“Disaster Recovery” and related thoughts…

Backup, Archive, High Availbility, Disaster Recovery, Business Continuity. All related. Yet all different.

One of my colleagues was recently faced with needing to run “a DR [disaster recovery] workshop” for a client. My initial impression was:

  • What disasters are they planning for?
  • I’ll bet they are thinking about Coronavirus and working remotely. That’s not really DR.
  • Or are they really thinking about a backup strategy?

So I decided to turn some of my rambling thoughts into a blog post. Each of these topics could be a post in its own right – I’m just scraping the surface here…

Let’s start with backup (and recovery)

Backups (of data) are a fairly simple concept. Anything that would create a problem if it was lost should be backed up. For example, my digital photos are considered to not exist at all unless they are synchronised (or backed up) to at least two other places (some network-attached storage, and the cloud).

In a business context, we run backups in order to be able to recover (restore) our content (configuration or data) within a given window. We may have weekly full backups and daily incremental or differential backups (perhaps with more regular snapshots), then retain parent, grandparent and great-grandparent copies of the full backups (four weeks) and keep each of these as (lunar) monthly backups for a year. That’s just an example – each organisation will have its own backup/retention policies and those backups may be stored on or off-site, on tape or disk.

In summary, backups are about making sure we have an up to date copy of our important configuration information and data, so we can recover it if the primary copy is lost or damaged.

And for bonus content, some services we might consider in a modern infrastructure context include Azure Backup or AWS Backup.

Backups must be verified and periodically tested in order to have any use.

Archiving information

When I wrote about backups above, I mentioned keeping multiple copies covering various points in time. Whilst some may consider this adequate for archival, archival is the storage of data for long-term preservation of read-only access – for example, documents that must be stored for an extended period of time (for example 7, 10, 25, 99 years). Once that would have been paper documents, in boxes. Now it might be digital files (or database contents) on tape or disk (potentially cloud storage).

Archival might still use backup software and associated retention policies, but we’ll think carefully about the medium we store it on. For very long term physical storage we might need to consider the media formats (paper is bulky and transferred to microfiche, or old magnetic media degrades, so it’s moved to optical storage – but the hardware becomes obsolete, so it’s moved to another format). If storing on disk (on-premises or in the cloud), we can use slower (cheaper) disks and accept that restoration from the archive may take additional time.

In summary, archival is about long-term data storage, generally measured in many years and archives might be stored off-line, or near-line.

Technologies we might use for archival are similar to backups, but we could consider lower-cost storage – e.g. Azure Storage‘s Cool or Archive tiers or Amazon S3 Glacier.

Keeping systems highly available

High Availability (HA) is about making sure that our systems are available for as much time as possible – or certainly within a given service level agreement (SLA).

Traditionally, we used technologies like a redundant array of inexpensive devices (RAID) for disks or memory, error checking memory, or redundant power supplies. We might also have created server clusters or farms. All of these methods have the intention of removing single points of failure (SPOFs).

In the cloud, we leave a lot of the infrastructure considerations to the cloud service provider and we design for failure in other ways.

  • We assume that virtual machines will fail and create availability sets.
  • We plan to scale out across multiple hosts for applications that can take advantage of that architecture.
  • We store data in multiple regions.
  • We may even consider multiple clouds.

Again, the level of redundancy built into the app and its supporting infrastructure must be designed according to requirements – as defined by the SLA. There may be no point in providing an expensive four nines uptime for an application that’s used once a month by one person, who works normal office hours. But, then again, what if that application is business critical – like payroll? Again, refer to the SLA – and maybe think about business continuity too… more on that in a moment.

Some of my clients have tried to implement Windows Server clusters in Azure. I’ve yet to be convinced and still consider that it’s old-world thinking applied in a contemporary scenario. There are better ways to design a highly available file service in 2020.

In summary, high availability is about ensuring that an application or service is available within the requirements of the associated service level agreement.

Technologies might include some of the hardware considerations I listed earlier, but these days we’re probably thinking more about:

Remember to also consider other applications/systems upon which an application relies.

Also, quoting from some of Microsoft’s training materials:

“To achieve four 9’s (99.99%), you probably can’t rely on manual intervention to recover from failures. The application must be self-diagnosing and self-healing.

Beyond four 9’s, it is challenging to detect outages quickly enough to meet the SLA.

Think about the time window that your SLA is measured against. The smaller the window, the tighter the tolerances. It probably doesn’t make sense to define your SLA in terms of hourly or daily uptime.”

Microsoft Learn: Design for recoverability and availability in Azure: High Availability

Disaster recovery

As the name suggests, Disaster Recovery (DR) is about recovering from a disaster, whatever that might be.

It could be physical damage to a piece of hardware (a switch, a server) that requires replacement or recovery from backup. It could be a whole server room or datacentre that’s been damaged or destroyed. It could be data loss as a result of malicious or accidental actions by an employee.

This is where DR plans come into play- firstly analysing the risks that might lead to disaster (including possible data loss and major downtime scenarios) and then looking at recovery objectives – the application’s recovery point objective (RPO) and recovery time objective (RTO).

Quoting Microsoft’s training materials again:

An illustration showing the duration, in hours, of the recovery point objective and recovery time objective from the time of the disaster.

“Recovery Point Objective (RPO): The maximum duration of acceptable data loss. RPO is measured in units of time, not volume: “30 minutes of data”, “four hours of data”, and so on. RPO is about limiting and recovering from data loss, not data theft.

Recovery Time Objective (RTO): The maximum duration of acceptable downtime, where “downtime” needs to be defined by your specification. For example, if the acceptable downtime duration is eight hours in the event of a disaster, then your RTO is eight hours.”

Microsoft Learn: Design for recoverability and availability in Azure: Disaster Recovery

For example, I may have a database that needs to be able to withstand no more than 15 minutes’ data loss and an associated SLA that dictates no more than 4 hours’ downtime in a given period. For that, my RPO is 15 minutes and the RTO is 4 hours. I need to make sure that I take snapshots (e.g. of transaction logs for replay) at least every 15 minutes and that my restoration process to get from offline to fully recovered takes no more than 4 hours (which will, of course, determine the technologies used).

Considerations when creating a DR plan might include:

  • What are the requirements for each application/service?
  • How are systems linked – what are the dependencies between applications/services?
  • How will you recover within the required RPO and RTO constraints?
  • How can replicated data be switched over?
  • Are there multiple environments (e.g. dev, test and production)?
  • How will you recover from logical errors in a database that might impact several generations of backup, or that may have spread through multiple data replicas?
  • What about cloud services – do you need to backup SaaS data (e.g. Office 365)? (Possibly not, if you’re happy with a retention-period based restoration from a “recycle bin” or similar but what if an administrator deletes some data?)

As can be seen, there are many factors here – more than I can go into in this blog post, but a disaster recovery strategy needs to consider backup/recovery, archive, availability (high or otherwise), technology and service (it may help to think about some of the ITIL service design processes).

In summary, disaster recovery is about having a plan to be able to recover from an event that results in downtime and data loss.

Technologies that might help include Azure Site Recovery. Applications can also be designed with data replication and recovery in mind, for example, using geo-replication capabilities in Azure Storage/Amazon S3, Azure SQL Server/Amazon RDS or using a globally-distributed database such as Azure Cosmos DB. And DR plans must be periodically tested.

Business continuity

Finally, Business Continuity (BC). This is something that many organisations will have had to contend with over the last few weeks and months.

BC is often confused with DR but they are different. Business continuity is about continuing to conduct business when something goes wrong. That may be how to carry on working whilst working on recovering from a disaster. Or it may be how to adapt processes to allow a workforce to continue functioning in compliance with social distancing regulations.

Again, BC needs a plan. But many of those plans will be reconsidered now – if your BC arrangements are that in the event of an office closure, people go to a hosted DR site with some spare equipment that will be made available within an agreed timescale, that might not help in the event of a global pandemic, when everyone else wants to use that facility. Instead, how will your workforce continue to work at home? Which systems are important?How will you provide secure remote access to those systems? (How will you serve customers whilst employees are also looking after children?) The list goes on.

Technology may help with BC, but technology alone will not provide a solution. The use of modern approaches to End User Computing will certainly make secure remote and mobile working a possibility (indeed, organisations that have taken a modern approach will probably already be familiar with those practices) but a lot of the issues will relate to people and process.

In summary, Business Continuity plans may be invoked if there is a disaster but they are about adapting business processes to maintain service in times of disruption.

Wrapping up

As I was writing this post, I thought about many tangents that I could go off and cover. I’m pretty sure the topic could be a book and this post scrapes the surface. Nevertheless, I hope my thoughts are useful and show that disaster recovery cannot be considered in isolation.

Where did my blog go? And why didn’t I have backups?


Earlier this week I wrote a blog post about SQL Server. I scheduled it to go live the following lunchtime (as I often do) and then checked to see why IFTTT hadn’t created a new link in bit.ly for me to tweet (I do this manually since the demise of TwitterFeed).

To my horror, my blog had no posts. Not one. Nothing. Nada. All gone.

Where did all the blog posts go?

As I’m sure you can imagine, the loss of 13 years’ of work invoked mild panic. But never mind, I have regular backups to Dropbox using the WordPress Backup to Dropbox plugin. Don’t I? Oh. It seems the last one ran in March.

History: Backup completed on Saturday March 11, 2017 at 02:24:40.

Ah.

Now, having done some research, my backups have probably been failing since my hosting provider updated PHP on the server. My bad for not checking them more carefully. I probably hadn’t noticed because the backup was partially running (so I was seeing the odd file written to Dropbox) and I didn’t realise the job was crashing part way through.

Luckily for me, the story has a happy ending*. My hosting provider (ascomi) kept backups (thank you Simon!). Although they found that a WordPress restore failed (maybe the source of the data loss was a database corruption), they were able to restore from a separate SQL backup. All back in a couple of hours, except for the most recent post, which I can write again one evening.

So, what about fixing those backups, Mark

Tonight, before writing another post, I decided to have a look at the broken backups.

I’m using WordPress to Dropbox v4.7.1 on WordPress 4.8, both of which are the latest versions at the time of writing. The Backup Monitor log for my last (manual) attempt at backing up read:

22:55:57: A fatal error occured: The backup is having trouble uploading files to Dropbox, it has failed 10 times and is aborting the backup.
22:55:57: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/banner/video-seo.png’ to Dropbox: unexpected parameter ‘overwrite’
22:55:56: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/banner/configuration-service.png’ to Dropbox: unexpected parameter ‘overwrite’
22:55:56: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/banner/news-seo.png’ to Dropbox: unexpected parameter ‘overwrite’
22:55:55: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/editicon.png’ to Dropbox: unexpected parameter ‘overwrite’
22:55:54: Processed 736 files. Approximately 14% complete.
22:55:54: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/link-out-icon.svg’ to Dropbox: unexpected parameter ‘overwrite’
22:55:53: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/extensions-local.png’ to Dropbox: unexpected parameter ‘overwrite’
22:55:53: Error uploading ‘/home/mwilson/public_html/blog/wp-content/plugins/wordpress-seo/images/question-mark.png’ to Dropbox: unexpected parameter ‘overwrite’
22:55:48: Processed 706 files. Approximately 13% complete.
22:55:42: Processed 654 files. Approximately 12% complete.
22:55:36: Processed 541 files. Approximately 10% complete.
22:55:30: Processed 416 files. Approximately 8% complete.
22:55:24: Processed 302 files. Approximately 6% complete.
22:55:18: Processed 170 files. Approximately 3% complete.
22:55:12: Processed 56 files. Approximately 1% complete.
22:55:10: Error uploading ‘/home/mwilson/public_html/blog/wp-content/languages/plugins/redirection-en_GB.mo’ to Dropbox: unexpected parameter ‘overwrite’
22:55:10: Error uploading ‘/home/mwilson/public_html/blog/wp-content/languages/plugins/widget-logic-en_GB.mo’ to Dropbox: unexpected parameter ‘overwrite’
22:55:09: Error uploading ‘/home/mwilson/public_html/blog/wp-content/languages/plugins/widget-logic-en_GB.po’ to Dropbox: unexpected parameter ‘overwrite’
22:55:09: Error uploading ‘/home/mwilson/public_html/blog/wp-content/languages/plugins/redirection-en_GB.po’ to Dropbox: unexpected parameter ‘overwrite’
22:55:06: SQL backup complete. Starting file backup.
22:55:06: Processed table ‘wp_yoast_seo_meta’.
22:55:06: Processed table ‘wp_yoast_seo_links’.
22:55:06: Processed table ‘wp_wpb2d_processed_files’.
22:55:06: Processed table ‘wp_wpb2d_processed_dbtables’.
22:55:06: Processed table ‘wp_wpb2d_premium_extensions’.
22:55:06: Processed table ‘wp_wpb2d_options’.
22:55:06: Processed table ‘wp_wpb2d_excluded_files’.
22:55:06: Processed table ‘wp_users’.
22:55:06: Processed table ‘wp_usermeta’.
22:55:06: Processed table ‘wp_terms’.
22:55:06: Processed table ‘wp_termmeta’.
22:55:06: Processed table ‘wp_term_taxonomy’.
22:55:06: Processed table ‘wp_term_relationships’.
22:55:05: Processed table ‘wp_redirection_logs’.
22:55:05: Processed table ‘wp_redirection_items’.
22:55:05: Processed table ‘wp_redirection_groups’.
22:55:05: Processed table ‘wp_redirection_404’.
22:55:03: Processed table ‘wp_ratings’.
22:55:03: Processed table ‘wp_posts’.
22:54:54: Processed table ‘wp_postmeta’.
22:54:49: Processed table ‘wp_options’.
22:54:48: Processed table ‘wp_links’.
22:54:48: Processed table ‘wp_feedfooter_rss_map’.
22:54:48: Processed table ‘wp_dynamic_widgets’.
22:54:48: Processed table ‘wp_comments’.
22:54:45: Processed table ‘wp_commentmeta’.
22:54:43: Processed table ‘wp_bad_behavior’.
22:54:43: Processed table ‘wp_auth0_user’.
22:54:43: Processed table ‘wp_auth0_log’.
22:54:43: Processed table ‘wp_auth0_error_logs’.
22:54:43: Starting SQL backup.
22:54:41: Your time limit is 90 seconds and your memory limit is 128M
22:54:41: Backup started on Tuesday July 25, 2017.

Hmm, fatal error caused by overwriting files in Dropbox… I can’t be the only one having this issue, surely?

Indeed not, as a quick Google search led me to a WordPress.org support forum post on how to tweak the WordPress Backup to Dropbox plugin for PHP 7. And, after making the following edits, I ran a successful backup:

“All paths are relative to $YOUR_SITE_DIRECTORY/wp-content/plugins/wordpress-backup-to-dropbox.

In file Dropbox/Dropbox/OAuth/Consumer/Curl.php: comment out the line:
$options[CURLOPT_SAFE_UPLOAD] = false;
(this option is no longer valid in PHP 7)

In file Dropbox/Dropbox/OAuth/Consumer/ConsumerAbstract.php: replace the test if (isset($value[0]) && $value[0] === '@') with if ($value instanceof CURLFile)

In file Dropbox/Dropbox/API.php: replace 'file' => '@' . str_replace('\\', '/', $file) . ';filename=' . $filenamewith 'file' => new CURLFile(str_replace('\\', '/', $file), "application/octet-stream", $filename)

(actually, a comment further down the post highlights there’s a missing comma after $filename) on that last edit so it should be 'file' => new CURLFile(str_replace('\\', '/', $file), "application/octet-stream", $filename),)

So, that’s the backups fixed (thank you @smowton on WordPress.org). I just need to improve my monitoring of them to keep my blog online, and my blood pressure at sensible levels…

 

*I still have some concerns, because the data loss occurred the night after a suspected hacking attempt on my LastPass account; which seems to have been thwarted by second-factor authentication… at least LastPass say it was…

Manually removing servers from an Exchange organization


I’m starting this blog post with a caveat: the process I’m going to describe here is not a good idea, goes against the advice of my colleagues (who have battle scars from when it’s been attempted in a live environment and not gone so well) and is certainly not recommended. In addition, I can’t be held responsible for any unintended consequences of following these steps.

Notwithstanding the above, I found myself trying to configure the Exchange Hybrid Configuration Wizard (HCW) in a customer environment, where the wizard failed because it was looking for servers that don’t exist any more.

I had two choices:

  1. Recover the missing Exchange servers with setup.exe /m:RecoverServer, then uninstall Exchange gracefully (for 12 servers!).
  2. Manually remove the servers using ADSI Edit.

I explained the situation to my customer, who discussed it with his Exchange expert, and they directed me to go for option 2 – this was a test environment, not production, and they were prepared to accept the risk.

Fearing the worst, I made a backup of Active Directory, just in case. This involved:

  1. Installing the Windows Server Backup Command Line Tools feature on the domain controller.
  2. Running wbadmin start systemstatebackup -backuptarget:driveletter:
  3. Sitting back and waiting for the process to complete.

With a backup completed, I could then:

  1. Run ADSI Edit.
  2. Open the configuration naming context.
  3. Navigate to CN=Servers,CN=Exchange Administrative Group (FYDIBOHF23SPDLT),CN=Administrative Groups,CN=organizationname,CN=Microsoft Exchange,CN=Services,CN=Configuration,DC=domainname,DC=tld
  4. Delete the records for the servers that no longer exist.
  5. Restart each of the remaining Exchange servers in the organisation in turn.
  6. Check the server list in ECP.

(Incidentally, FYDIBOHF23SPDLT is “Caesar’s Cipher” for EXCHANGE12ROCKS).

Murat Yildirimoglu’s Windows IT Pro article entitled “How to Uninstall a Stubborn Exchange Server” goes into more detail, including completely removing an Exchange organisation from Active Directory, should that be required (Christopher Dargel covers that too).

The process seemed to work but the danger of manually removing servers from an Exchange organization like this is the potential side effects of “unknown unknowns” (which you can be sure won’t surface immediately). It did let me progress to the next stage of the HCW though. More on that in a future blog post…

Sorting out my home backups


After my parents-in-law’s recent burglary (and related data loss), I started to think more seriously about my household’s backups which are spread across a variety of USB drives, NAS units and cloud services (Dropbox, SkyDrive, Box.net, etc.).

My plan is to:

  1. Duplicate – hard drives fail. I know, because I’ve lost data that way – and RAID is no substitute for a proper backup (as I learned the hard way). If it doesn’t exist in (at least) two places, it doesn’t exist.
  2. Consolidate – bits and pieces on various drives is a nightmare – to know that it’s definitely backed up, I need to know it’s on the “big backup drive” (as well as in the primary source).
  3. Archive – both physically (media stored in a safe) and virtually (upload to the cloud). Be ready for some long uploads though, over an extended period (I only have ADSL 2 – no fibre here).

Steps 1 and 2 work hand in hand and, last weekend, I picked up a 3TB Seagate Backup Plus Desktop drive. I’m not using the bundled backup software that offers idiot-proof backups for both local and social media (Facebook, Flickr) data but installing the software on my MacBook includes Paragon NTFS for Mac, which means I can use this drive with Macs and PCs without reformatting (there is a Mac version too – although the only differences I can see from a comparison of Seagate’s data sheets for “normal” and Mac versions are: Firewire and USB 2.0 cables instead of USB 3.0; downloadable HFS+ driver for Windows instead of preloaded NTFS driver for Mac OS X; 3 year warranty instead of 2 years).

Step 3 is more involved. I did some analysis into a variety of cloud services a while ago and found that each one has pros/cons depending on whether you want to back up a single computer or multiple computers, limitations on storage, cost, etc. I didn’t get around to publishing that information but there is a site called Which Online Backup that might help (although I’m not sure how impartial it is – it’s certainly nothing to do with the Which? consumer information/campaign service).

My current thinking is that I’ll continue to use free services like Dropbox to backup and sync many of my commonly-used files (encrypting sensitive information using TrueCrypt) at the same time as creating a sensible archive strategy for long term storage of photographs, etc. That strategy is likely to include Amazon Glacier but, because of the way that the service works, I’ll need to think carefully about how I create my archives – Glacier is not intended for instant access, nor is it for file-level storage.

I’ll write some more as my archive strategy becomes reality but, in the meantime, the mass data copy for the duplicate and consolidate phases has begun, after which all other copies can be considered “uncontrolled”.