Backups

Resources

Slides
Data Loss on Wikipedia
10 Common Causes of Data Loss from Consolidated Technologies
Backup Types Explained: Full, Incremental, Differential, Synthetic, and Forever-Incremental from Nakivo
Understanding RPO and RTO from Druva
What is High Availability? from DigitalOcean
Law Firm Retools its Backup Scenario from Network World
Postmortem of Database Outage of January 31 from GitLab

Video Transcript

Backups are an important part of any system administrator’s toolkit. At the end of the day, no amount of planning and design can completely remove the need for high quality, reliable backups of the data that an organization needs to function. As a system administrator, you may be responsible for designing and executing a backup strategy for your organization. Unfortunately, there is no “one size fits all” way to approach this task, as each organization’s needs are different. In this video, I’ll discuss some different concepts and techniques that I’ve acquired to help me think through the process of creating a backup strategy.

To begin, I like to ask myself the classic questions that any journalist begins with: “Who? What? When? Where? Why? How?” Specifically, I can phrase these to relate to the task of designing a backup strategy: “Who has the data? What data needs to be stored? When should it be backed up? Where should those backups be stored? Why can data loss occur in this organization? How should I create those backups?” Let’s look at each question in turn to see how that affects our overall backup strategy.

First, who has the data? I recommend starting with a list of all of the systems and people in your organization, and getting a sense of what data each has. In addition, you’ll need to identify all of the servers and network devices that will need analyzed. Beyond those systems, there may be credentials and security keys stored in safe locations around the organization that will need to be considered. Finally, especially when dealing with certain types of data, you may also need to consider ownership of the data and any legal issues that may be involved, such as HIPAA for health data, or FERPA for student data. Remember that, storing data for your customers doesn’t mean that you have any ownership of that data or the intellectual property within it.

Next, once you’ve identified where the data may be stored, you’ll need to consider the types of data that need to be stored. This could include accounting data and personnel files to help your organization operate smoothly, but also the web assets and user data that may be stored on your website. Beyond that, there is a plethora of data hiding on your systems themselves, such as the network configuration, filesystems, and even the metadata for individual files stored on the systems. At times, this can be the most daunting step once you start to consider the large amount of decentralized data that most organizations have to deal with.

Once you have a good idea of the data you’ll need to back up, the next question you’ll have to consider is when to make your backups. There are many obvious options, from yearly all the way down to instantaneously. So, how do you determine which is right for your organization?

One way to look at it is to consider the impact an issue might have on your organization. Many companies use the terms Recovery Point Objective, or RPO, and Recovery Time Objective, or RTO, to measure these items. RPO refers to how much data may be lost when an issue occurs, and RTO measures how long the systems might be unavailable while waiting for data and access to be restored. Of course, these are goals that your organization will try to meet, so the actual time may be different for each incident.

This timeline shows how they are related. In this example, your organization has created a backup at the point in time labelled “1” in this timeline. Then, some point in the future, an incident occurs. From that point, your organization has an RPO set, which is the maximum amount of data they would like to lose. However, the actual data loss may be greater, depending on the incident and your organization’s backup strategy. Similarly, the RTO sets a desired timeline for the restoration of data and service, but the actual time it takes to restore the data may be longer or shorter. So, when creating a backup strategy, you’ll need to choose the frequency based on the RPO and RTO for your organization. Backups made less frequently could result in a higher RPO, and backups that are less granular or stored offline could result in a higher RTO.

You will also have to consider where you’d like to store the data. One part of that equation is to consider the type of storage device that will be used. Each one comes with its own advantages and disadvantages, in terms of speed, lifetime, storage capacity, and, of course, cost. While it may seem counter-intuitive, one of the leading technologies for long-term data storage is still the traditional magnetic tape, as it already has a proven lifetime of over 50 years, and each tape can hold several terabytes of data. In addition, tape drives can sustain a very high read and write speed for long periods of time. Finally, for some very critical data, storing it in a physical, on-paper format, may not be a bad idea either. For passwords, security keys, and more, storing them completely physically can prevent even the most dedicated hacker from ever accessing them.

Beyond just the storage media, the location should also be considered. There are many locations that you could store your backup data, relative to where the data was originally stored. For example, a backup can be stored online, which means it is directly and readily accessible to the system it came from in case of an incident. Similarly, data can be stored near-line, meaning that it is close at hand, but may require a few seconds to access the data. Backups can also be stored offline, such as on a hard disk or tape cartridge that is disconnected from a system but stored in a secure location.

In addition, data can be stored in a variety of ways at a separate location, called an offsite backup. This could be as simple as storing the data in a secure vault in a different building, or even shipping it thousands of miles away in case of a natural disaster. Lastly, some organizations, such as airlines or large retail companies, may even maintain an entire backup site, sometimes called a “disaster recovery center,” which contains enough hardware and stored backup data to allow the organization to resume operations from that location very quickly in the event of an incident at the primary location.

When storing data offsite or creating a backup site, it is always worth considering how that site is related to your main location. For example, could both be affected by the same incident? Are they connected to the same power source? Would a large hurricane possibly hit both sites? Do they use the same internet service provider? All of these can be a single point of failure that may disable both your primary location and your backup location at the same time. Many companies learned this lesson the hard way during recent hurricanes, such as Katrina or Sandy. I’ve posted the story of one such company in the resources section below this video.

Optimization is another major concern when considering where to store the data. Depending on your needs, it may be more cost-effective or efficient to look at compressing the backup before storing it, or performing deduplication to remove duplicated files or records from the company-wide backup. If you are storing secure data, you may also want to encrypt the data before storing it on the backup storage media. Finally, you may even need to engage in staging and refactoring of your data, where backup data is staged in a temporary location while it is being fully stored in its final location. By doing so, your organization can continue to work with the data without worrying about their work affecting the validity of the backup.

Another important part of designing a backup strategy is to make sure you understand the ways that an organization could lose data. You may not realize it, but in many cases the top source of data loss at an organization is simply user error or accidental deletion. Many times a user will delete a file without realizing its importance, only to contact IT support to see if it can be recovered. The same can happen to large scale production systems, as GitLab experienced in 2017. I’ve posted a link to their very frank and honest post-mortem of that event in the resources section below this video.

There are many other ways that data could be lost, such as software or hardware failure, or even corruption of data stored on a hard drive itself as sectors slowly degrade over time. You may also have to deal with malicious intent, either from insiders trying to sabotage a system, all the way to large-scale external hacks. Lastly, every organization must consider the effect natural disasters may have on their data.

Finally, once you’ve identified the data to be backed up, when to create the backups, and where to store them, the last step is to determine how to create those backups themselves. There are many different types of backups, each with their own features and tradeoffs. First, many companies still use an unstructured backup, which I like to think of as “just a bunch of CDs and flash drives” in a safe. In essence, they copy their important files to an external storage device, drop it in a safe, and say that they are covered. While that is definitely better than no backup at all, it is just the first step toward a true backup strategy.

Using software, it is possible to create backups in a variety of ways. The first and simplest is a full backup. Basically, the system makes a full, bit by bit, copy of the data onto a different storage device. While this may be simple, it also requires a large amount of storage space, usually several times the size of the original data if you’d like to store multiple versions of the data.

Next, software can help you create incremental backups. In this case, you start with a full backup, usually made once a week, then each following day’s backup only stores the data that has changed in the previous day. So, on Thursday, the backup only stores data that has changed since Wednesday. Then, on Friday, if the backup needs to be restored, the system will need access to all backups since the full backup on Monday in order to restore the data. So, if any incremental backup is lost, the system may be unrecoverable to its most recent state. However, this method requires much less storage than a full backup, so it can be very useful.

Similarly, differential backups can also be used. In this case, each daily backup stores the changes since the most recent full backup. So, as long as the full backup hasn’t been lost, the most recent full backup plus a differential backup is enough to restore the data. This requires a bit more storage than an incremental backup, but it can also provide additional data security. It is also possible to keep fewer full backups in this scenario to reduce the storage requirements.

Finally, some backups systems also support a reverse incremental backup system, sometimes referred to as a “reverse delta” system. In this scenario, each daily backup consists of an incremental backup, then the backup is applied to the existing full backup to keep it up to date. Then, the incremental backups are stored as well, allowing those changes to be reversed if needed. This allows the full backup to be the most recent one, making restoration much simpler than traditional incremental or differential backups. In addition, only the most recent backup is needed to perform a full recovery. However, as with other systems, older data can be lost if the incremental backups are corrupted or missing.

Lastly, there are a few additional concerns around backup strategies that we haven’t discussed in this video. You may need to consider the security of the data stored in your backup location, as well as how to go about validating that the backup was properly created. Many organizations perform practice data recoveries often, just to make sure the process works properly.

When creating backups, your organization may have a specific backup window, such as overnight or during a planned downtime. However, it may be impossible to make a backup quickly enough during that time. Alternatively, you may need to run backups while the system is being used, so there could be significant performance impacts. In either case, you may need to consider creating a staging environment first.

Also, your organization will have to consider the costs of buying additional storage or hardware to support a proper backup system. Depending on the metric used, it is estimated that you may need as much as 4 times the storage you are using in order to create proper backups. Lastly, as we discussed earlier, you may need to consider how to make your systems available in a distributed environment, in case of a large-scale incident or natural disaster.

Of course, don’t forget that backups are just one part of the picture. For many organizations, high availability is also a desired trait, so your system’s reliability design should not just consist of backups. By properly building and configuring your infrastructure, you could buy yourself valuable time to perform and restore backups, or even acquire new hardware in the case of a failure, just by having the proper system design.

Hopefully this overview of how to create a backup strategy has given you plenty of things to consider. As part of the lab assignment, you’ll work with a couple of different backup scenarios in both Windows and Ubuntu. I also encourage you to review your own personal backup plan at this point, just to make sure you are covered. Sometimes it is simple to analyze these issues on someone else’s environment, but it is harder to find our own failures unless we consciously look for them.