Chapter 7

Backups, Monitoring & DevOps

Automatic deployment & 99.9999% uptime.

Subsections of Backups, Monitoring & DevOps

Introduction

YouTube Video

Resources

Video Transcript

Welcome to Module 7! In this module, we’ll discuss several additional topics related to system administration. Honestly, each of these items probably deserves a module unto themselves, if not an entire course. However, since I am limited in the amount of things I can reasonably cover in a single course, I’ve chosen to group these items together into this single module.

First, we’ll discuss DevOps, or development operations, a relatively new field blending the skills of both system administrators and software developers into a single unit. We’ll also discuss a bit of the theory related to system administration, including ITIL, formerly the Information Technology Infrastructure Library, a very standard framework for system administrators to follow.

Then, we’ll cover information related to monitoring your systems on your network and in the cloud, and how to configure some of those tools in our existing infrastructure from previous lab assignments. Finally, we’ll discuss backups, and you’ll perform a couple of backup and restore exercises as part of the lab assignment.

As always, if you have any questions or run into issues, please post in the course discussion forums to get help. Good luck!

DevOps

YouTube Video

Resources

Video Transcript

One of the hottest new terms in system administration and software development today is DevOps. DevOps is a shortened form of “Development Operations,” and encompasses many different ideas and techniques in a single term. I like to think of it as the application of the Agile software development process to system engineering. Just as software developers are focusing on more frequent releases and flexible processes, system administrators have to react to quickly changing needs and environments. In addition, DevOps engineers are responsible for automating much of the software development lifecycle, allowing developers to spend more time coding and less time running builds and tests manually. Finally, to accommodate many of these changes, DevOps engineers have increasingly adapted the concept of “infrastructure as code,” allowing them to define their system configurations and more using tools such as Puppet and Ansible in lieu of doing it manually. As we saw in Module 2, that approach can be quite powerful.

Another way to think of DevOps is the intersection of software development, quality assurance, and operations management. A DevOps engineer would work fluidly with all three of these areas, helping to automate their processes and integrate them tightly together.

There are many different ways to look at the DevOps toolchain. This graphic gives one, which goes through a process of “plan, create, verify, package, release, configure, and monitor.” The first four steps deal with the development process, as developers work to plan, test, and package their code for release. Then, the operations side takes over, as the engineers work to make that code widely available in the real world, dealing with issues related to configuration and monitoring.

For this lecture, I’ll look at a slightly modified process, consisting of these seven steps. Let’s take a look at each one in turn and discuss some of the concepts and tools related to each one.

First is code. As software developers work to create the code for a project, DevOps engineers will be configuring many of the tools that they use. This could be managing a GitHub repository, or even your own GitLab instance. In addition, you could be involved in configuring their development and test environments using tools such as Vagrant and Puppet, among others. Basically, at this phase, the responsibility of a DevOps engineer is to allow developers to focus solely on creating code, without much worry about setting up the environments and tools needed to do their work.

The next steps, build and test, are some of the most critical for this process. As developers create their code, they should also be writing automated unit tests that can be run to verify that the code works properly. This is especially important later on, as bugfixes may inadvertently break code that was previously working. With a properly developed suite of tests, such problems can easily be detected. For a DevOps engineer, the most important part is automating this process, so that each piece of code is properly tested and verified before it is used. Generally, this involves setting up automated building and testing procedures using tools such as Travis and Jenkins. When code is pushed to a Git repository, these tools can automatically download the latest code, perform the build process, and run any automated tests without any user intervention. If the tests fail, they can automatically report that failure back to the developer.

Once the product is ready for release, the package and release processes begin. Again, this process can be fully automated, using tools in the cloud such as Heroku, AWS, and Azure to make the software available to the world. In addition, there are a variety of tools that can be used to make downloadable packages of the software automatically. For web-based applications, this process could even be chained directly after the test process, so that any build that passes the tests is automatically released to production.

Of course, to make a package available in production requires creating an environment to host it from, and there are many automated configuration tools to help with that. We’ve already covered Puppet in this class, and Ansible and SaltStack can perform similar functions in the cloud. In addition, you may choose to use container tools such as Docker to make lightweight containers available, making deployment across a variety of systems quick and painless.

Lastly, once your software is widely available, you’ll want to continuously monitor it to make sure it is available and free of errors. There are many free and paid tools for performing this task, including Nagios, Zabbix, and Munin. As part of this lab’s assignment, you’ll get to set up one of these monitoring tools on your own cloud infrastructure, just to see how they work in practice.

Of course, one of the major questions to ask is “Why go to all this trouble?” There are many reasons to consider adopting DevOps practices in any organization. First and foremost is to support a faster release cycle. Once those processes are automated, it becomes much easier to quickly build, test, and ship code. In addition, the whole process can be more flexible, as it is very easy to change a setting in the automation process to adapt to new needs. This supports one of the core tenets of the agile lifecycle, which focuses on working software and responding to change. In addition, DevOps allows you to take advantage of the latest developments in automation and virtualization, helping your organization stay on top of the quickly changing technology landscape. Finally, in order for your environment to be fluidly scalable, you’ll need to have a robust automation architecture, allowing you to provision resources without any human interaction. By adopting DevOps principles, you’ll be ready to make that leap as well.

Finally, to give you some experience working with DevOps, let’s do a quick example. I’ve named this the “Hello World” of DevOps, as I feel it is a very good first step into that world, hopefully opening your eyes to what is possible. You’ll perform this activity as part of the assignment for this lab as well.

First, you’ll need to create a project on the K-State CS GitLab server, and push your first commit to it. You can refer to the Git video in the Extras module for an overview of that process. I’m going to use an existing project, which is actually the source project for this course. We’ll be using the Webhooks feature of this server. When certain events happen in this project, we can configure the server to send a web request to a particular URL with the relevant data. We can then create a server to listen for those requests and act upon them.

You will also need to either configure SSH keys for your GitLab repository, or configure the repository to allow public access.

Next, we’ll need to install and configure the webhook server on our DigitalOcean droplet. It is a simple server that allows you to listen for those incoming webhooks from GitHub and GitLab, and then run a script when they are received. Of course, in practice, many times you’ll be responsible for writing this particular piece yourself in your own environment, but this is a good example of what it might look like.

On my Ubuntu cloud server, I can install webhook using APT:

sudo apt update
sudo apt install webhook

Next, I’ll need to create a file at /etc/webhook.conf and add the content needed to create the hook:

[
  {
    "id": "cis527online",
    "execute-command": "/home/cis527/bin/cis527online.sh",
    "command-working-directory": "/var/www/bar/html/",
    "response-message": "Executing checkout script",
    "trigger-rule":
    {
      "match":
      {
        "type": "value",
        "value": "f0e1d2c3b4",
        "parameter":
        {
          "source": "header",
          "name": "X-Gitlab-Token"
        }
      }
    }
  }
]

You can find instructions and sample hook definitions on the webhook documentation linked in the resources section below this video. You’ll need to configure this file to match your environment. Also, you can use this file to filter the types of events that trigger the action. For example, you can look for pushes to specific branches or tags.

Since this file defines a script to execute when the Webhook is received, I’ll need to create that script as well:

#!/bin/bash

git pull
exit 0

This script is a simple script that uses the git pull command to get the latest updates from Git. I could also place additional commands here if needed for this project.

Once I create the script, I’ll need to modify the permissions to make sure it is executable by all users on the system:

chmod a+x bin/cis527online.sh

Once everything is configured, I can restart the webhook server using this command:

sudo systemctl restart webhook

Then, if everything is working correctly, I can test it using my cloud server’s external IP address on port 9000, with the correct path for my hook. For the one I created above, I would visit http://<ip_address>:9000/hooks/cis527online. You should see the response Hook rules were not satisfied displayed. If so, your webhook server is up and running. Of course, you may need to modify your firewall configuration to allow that port.

Lastly, if my repository requires SSH keys, I’ll need to copy the public and private keys into the root user’s .ssh folder, which can be found at /root/.ssh/. Since webhook runs as a service, the Git commands will be run as that user, and it’ll need access to that key to log in to the repo. There are more advanced ways of doing this, but this is one way that works.

I’ll also need to download a copy of the files from my Git repository onto this system in the folder I specified in webhook.conf

sudo rm /var/www/bar/html/*
sudo git checkout <repository_url> /var/www/bar/html

Now, we can go back to the K-State CS GitLab server and configure our webhook. After you’ve opened the project, navigate to Settings and then Integrations in the menu on the left. There, you can enter the URL for your webhook, which we tested above. You’ll also need to provide the token you set in the webhook.conf file. Under the Trigger heading, I’m going to checkmark the “Push events” option so that all push events to the server will trigger this webhook. In addition, I’ll uncheck the option for “Enable SSL verification” since we have not configured an SSL certificate for our webhook server. Finally, I’ll click Add webhook to create it.

Once it is created, you’ll see a Test button below the list of existing webhooks. So, we can choose that option, and Select “Push events” to test the webhook. If all works correctly, it should give us a message stating that the hook executed successfully.

However, the real test is to make a change to the repository, then commit and push that change. Once you do, it should automatically cause the webhook to fire, and within a few seconds you should see the change on your server. I encourage you to test this process for yourself to make sure it is working correctly.

There you go! That should give you a very basic idea of some of the tools and techniques available in the DevOps world. It is a quickly growing field, and it is very useful for both developers and system administrators to understand these concepts. If you are interested in learning more, I encourage you to read some of the materials linked in the resources section below this video, or explore some of the larger projects hosted on GitHub to see what they are doing to automate their processes.

ITIL

YouTube Video

Resources

Video Transcript

In this video, I’ll be taking a detour from all of the technical topics in this class to discuss a more philosophical topic: the art of system administration. Specifically, this video will introduce you to ITIL, one of the core resources for many IT organizations worldwide.

First, you might be wondering what ITIL is. Originally, it was an acronym for the “Information Technology Infrastructure Library,” but it has since been officially shortened to just ITIL. ITIL is a comprehensive set of standards and best practices for all aspects of an IT infrastructure. Many organizations refer to this as “IT Service Management” or “ITSM” for short. By making use of the information found in the ITIL resources, you should be able to keep your IT systems running and make your customers happy.

This slide gives a brief history of ITIL. It was originally a set of recommendations developed by the UK Central Computer & Telecommunications Agency in the 1980s. That information was shared with many government and private entities, and they found it to be a very useful resource. By the 1990s, it was expanded to include over 30 volumes of information on a number of IT processes and services. As the technology matured, those volumes were collected into 9 sets of guidelines in the early 2000s, and by 2007, a complete rework was released, consisting of just 5 volumes encompassing 26 processes in all. This version is known as ITIL Version 3.

In 2011, a small update to version 3 was released by the UK Office of Government Commerce, the maintainers of ITIL at that time. However, starting in 2013, the licensing and maintenance of ITIL was handed over to Axelos. They also handle ITIL certifications. Recently, Axelos announced that they are working on ITIL 4, which should be released sometime in the next year.

As mentioned before, the current version of ITIL consists of 5 volumes: Service Strategy, Service Design, Service Transition, Service Operation, and Continual Service Improvement.

These volumes are interrelated, forming the backbone of any IT organization. This diagram shows how they all fit together. At the core is strategy, which defines the overall goals and expectations of the IT organization. Next, design, transition, and operation deal with how to put those strategies into practice and how to keep them running smoothly. Finally, Continual Service Improvement is all about performing an introspective look at the organization to help it stay ahead of trends in technology, while predicting and responding to upcoming threats or issues.

To break it down even further, this diagram lists most of the 26 processes included in ITIL. “Supplier Management” under “Design” is omitted for some reason. As you can see, there are quite a variety of processes and related items contained in each volume of ITIL.

However, I’ve found one of the easiest ways to discuss ITIL is to relate it to a restaurant. This analogy is very commonly found on the internet, but the version I am using can be most closely attributed to Paul Solis. So, let’s say you want to open a restaurant, but you’d like to follow the ITIL principles to make sure it is the best, most successful restaurant ever. So, first you’ll work on your strategy. That includes the genre of food you’d like to serve, the location, and an analysis of future trends in the restaurant business. You don’t want to enter a market that is already saturated, or serve a food genre that is becoming less popular.

Next, you’ll discuss details in the design. This includes determining specific menu items, the hours of operation, staffing needs, and how you’ll go about hiring and training the people you need.

Once you have everything figured out, its time to enter the transition phase. During this phase, you’ll test the menu items, maybe have a soft open to make sure the equipment and staff are all functioning properly, and validating any last-minute details of your design before the grand opening.

Once the restaurant is open for business, it’s time to enter the operational phase. In this phase, you’ll be looking at the details in the day-to-day operations of the restaurant, such as how customers are greeted and seated when they enter, how your wait staff enters the food orders, how the kitchen prepares the food, and finally how your clients are able to pay and be on their way. You’ll always be looking for issues in any of these areas, and work diligently to correct them as quickly as possible.

In addition, great restaurants will undergo continual service improvement, always looking for ways to be better and bring in more business. This includes surveys, market research, testing new recipes, and even undergoing full menu changes to keep things fresh and exciting.

Overall, if you follow these steps, you should be able to build a great restaurant. If you’d like a great exercise applying this in practice, I encourage you to watch an episode or two of a TV show that focuses on fixing a failing restaurant, such as “Restaurant Impossible” or “Kitchen Nightmares,” and see if you can spot exactly where in this process the restaurant started to fail. Was the menu too large and confusing? Were the staff not trained properly? Did they fail to react to customer suggestions? Did the market change, leaving them behind? I’ve seen all of those issues, and more, appear during episodes of these TV shows, but each one always links directly back to one of the volumes and processes in ITIL. My hope is that you can see how these same issues could affect any IT organization as well.

ITIL also proposes a maturity model, allowing organizations to rate themselves based on how well they are conforming to ITIL best practices. Level 0 represents the total absence of a plan, or just outright chaos. At Level 1, the organization is starting to formulate a plan, but at best it is just a reactive one. I like to call this the “putting out fires” phase, but the fires are starting as fast as they can be put out. At Level 2, the organization is starting to be a bit more active in planning, so at this point they aren’t just putting out fires, but they are getting to the fires as fast as they develop.

By Level 3, the organization is becoming much more defined in its approach, and is working proactively to prevent fires before they occur. There are still fires once in a while, but many of them are either predicted or easily mitigated. At Level 4, most risks are managed, and fires are minimal and rare. Instead, they are preemptively fixing problems before they occur, and always working to provide the best service possible. Lastly, at Level 5, everything is fully optimized and automated, and any error that is unexpected is easily dealt with, usually without the customers even noticing. I’d say that Netflix is one of the few organizations that I can say is very comfortably at Level 5, based on the case-study from Module 5. Most organizations are typically in the range of Level 3 or 4.

There are a variety of factors that can limit and organization’s ITIL maturity level, but in my opinion, they usually come down to cost. For an organization to be at level 5, it requires a large investment in IT infrastructure, staff, and resources. However, for many large organizations, IT is always seen as a “red line” on the budget, meaning that it costs the organization more money than it brings in. So, when the budget gets tight, often IT is one of the groups to be downsized, only to lead to major issues down the line. Many companies simply don’t figure the cost of an IT failure into their budget until after it happens, and by then it is too late.

Consider the recent events at K-State with the Hale Library fire. If K-State was truly operating at Level 5, there would most likely have to be a backup site or distributed recovery center, such that all systems could be back online in a matter of hours or minutes after a failure. However, that would be an astronomical cost to the university, and often it is still cheaper to deal with a few days of downtime instead of paying the additional cost to maintain such a system.

So, in summary, why should an IT organization use ITIL? First, IT is a very complex area, and it is constantly changing. At the same time, user satisfaction is very important, and IT organizations must do their best to keep customers happy. Even though it is a very technical field, IT can easily be seen as a service organization, just like any restaurant. Because of that, many IT organizations can adopt some of the same techniques and processes that come from those more mature fields, and ITIL is a great reference that brings all of that industry knowledge and experience to one place. Lastly, poor management can be very expensive, so even though it may be expensive to maintain your IT systems at a high ITIL maturity level, that expense may end up saving money in the long run by preventing catastrophic errors. If you’d like to know more, feel free to check out some of the resources linked below the video in the resources section. Unfortunately, the ITIL volumes are not freely available, but many additional resources are.


I wanted to come back to this lecture and add a bit more information at the end to clarify a few things regarding the relationship between ITIL and a different term, IT Service Management or ITSM. I added a couple of links below this video that provide very good discussions of what ITSM is and how it relates to ITIL.

In short, IT service management is the catch-all term for how an IT organization handles its work. It could deal with everything from the design of systems to the operational details of keeping it up and running and even the management of changes that need to be applied.

Within the world of ITSM, one of the most common frameworks for IT Service Management is ITIL. So, while I focus on ITIL in this video as the most common framework, it is far from the only one out there, and it has definite pros and cons.

So, as you move into the world of IT, I recommend looking a bit broader at IT Service Management as a practice, and then evaluate the many frameworks that exist for implementing that in your organization.

Assignment

Lab 7 - Backups, Monitoring & DevOps

Instructions

Create two cloud systems meeting the specifications given below. The best way to accomplish this is to treat this assignment like a checklist and check things off as you complete them.

If you have any questions about these items or are unsure what they mean, please contact the instructor. Remember that part of being a system administrator (and a software developer in general) is working within vague specifications to provide what your client is requesting, so eliciting additional information is a very necessary skill.

Note

To be more blunt - this specification may be purposefully designed to be vague, and it is your responsibility to ask questions about any vagaries you find. Once you begin the grading process, you cannot go back and change things, so be sure that your machines meet the expected specification regardless of what is written here. –Russ

Also, to complete many of these items, you may need to refer to additional materials and references not included in this document. System administrators must learn how to make use of available resources, so this is a good first step toward that. Of course, there’s always Google!

Time Expectation

This lab may take anywhere from 1 - 6 hours to complete, depending on your previous experience working with these tools and the speed of the hardware you are using. Working with each of these items can be very time-consuming the first time through the process, but it will be much more familiar by the end of this course.

Info

This lab involves working with resources on the cloud, and will require you to sign up and pay for those services. In general, your total cost should be low, usually around $20 total. If you haven’t already, you can sign up for the GitHub Student Developer Pack to get discounts on most of these items. If you have any concerns about using these services, please contact me to make alternative arrangements! –Russ


Task 0: Droplets & Virtual Machines

For this lab, you will continue to use the two DigitalOcean droplets from Labs 5 and 6, labelled FRONTEND and BACKEND, respectively. This assignment assumes you have completed all steps in the previous labs successfully; if not, you should consult with the instructor to resolve any existing issues before continuing.


Task 1: Backup Ubuntu Web Application

For this task, you will perform the steps to create a backup of the web application installed on your Ubuntu droplet in Lab 6. To complete this item, prepare an archive file (.zip, .tar, .tgz or equivalent) containing the following items:

  1. Website data and configuration files for the web application. This should NOT include the entire application, just the relevant configuration and data files that were modified after installation. For Docker installations, any information required to recreate the Docker environment, such as a Docker Compose file, should also be included.
  2. Relevant Apache or Nginx configuration files (virtual hosts, reverse proxy, etc.)
  3. A complete MySQL server dump of the appropriate MySQL database. It should contain enough information to recreate the database schema and all data.
  4. Clear, concise instructions in a README file for restoring this backup on a new environment. Assume the systems in that new environment are configured as directed in Lab 5. These instructions would be used by yourself or a system administrator of similar skill and experience to restore this application - that is, you don’t have to pedantically spell out how to perform every step, but you should provide enough information to easily reinstall the application and restore the backup with a minimum of effort and research.

Resources


Task 2: Ubuntu Monitoring

For this task, you will set up Checkmk to monitor your servers.

  1. Configure the Ubuntu droplet named FRONTEND as the primary host for Checkmk.
  2. Then, add the FRONTEND and BACKEND droplets as two monitored hosts
  3. Send the URL of the Checkmk dashboard and the password in your grading packet. Make sure that both FRONTEND and BACKEND are appearing in the data.

Of course, you may need to modify your firewall configuration to allow incoming connections for this to work! If your firewall is disabled and/or not configured, there will be a deduction of up to 10% of the total points on this lab

Tip

As always, you may have to deal with Apache virtual hosts and firewalls for this setup. In addition, you may want to add a new A record to your domain name for this site, and request an SSL certificate via CertBot. –Russ

Resources


Task 3: DevOps

Setup an automatically deployed Git repository on your Ubuntu droplet. For this task, perform the following.

Option 1 - CS GitLab

  1. Create a GitLab repository on the K-State CS GitLab instance. That repository must be public and clonable via HTTPS without any authenication. Unfortunately K-State blocks cloning repositories via SSH.
  2. Clone that repository on your own system, and verify that you can make changes, commit them, and push them back to the server.
  3. Clone that repository into a web directory on your Ubuntu droplet named BACKEND. You can use the default directory first created in Lab 5 (it should be cis527charlie in your DNS).
  4. Create a Bash script that will simply use the git pull command to get the latest content from the Git repository in the current directory.
  5. Install and configure webhook on your Ubuntu droplet named BACKEND. It should listen for all incoming webhooks from GitLab that match a secret key you choose. When a hook is received, it should run the Bash script created earlier.
  6. Configure a webhook in your GitLab repository for all Push events using that same secret key and the URL of webhook on your server. You may need to make sure your domain name has an A record for the default hostname @ pointing to your BACKEND server.
  7. To test this setup, you should be able to push a change to the GitLab repository, and see that change reflected on the website automatically.
  8. For offline grading, add the instructor (@russfeld) to the repository as maintainers, and submit the repository and URL where the files can be found in your grading packet. Provided the webhook works correctly, they should be able to see a pushed change to the repository update the website.

Of course, you may need to modify your firewall configuration to allow incoming connections for Webhook! If your firewall is disabled and/or not configured, there will be a deduction of up to 10% of the total points on this lab

Tip

In the video I use SSH to connect to GitLab, but that is no longer allowed through K-State’s firewall. You’ll need to create a public repository and clone it using HTTPS without authentication for this to work. Alternatively, you can use Option 2 below to configure this through GitHub instead.–Russ

Option 2 - GitHub

  1. Create a GitHub repository on the GitHub instance. That repository may be public or private. If private, the repository should be set up to be cloned via SSH.
  2. Clone that repository on your own system, and verify that you can make changes, commit them, and push them back to the server.
  3. Clone that repository into a web directory on your Ubuntu droplet named BACKEND. You can use the default directory first created in Lab 5 (it should be cis527charlie in your DNS).
  4. Create a Bash script that will simply use the git pull command to get the latest content from the Git repository in the current directory.
  5. Install and configure webhook on your Ubuntu droplet named BACKEND. It should listen for all incoming webhooks from GitHub that match a secret key you choose. When a hook is received, it should run the Bash script created earlier.
  6. Configure a webhook in your GitHub repository for all Push events using that same secret key and the URL of webhook on your server. You may need to make sure your domain name has an A record for the default hostname @ pointing to your BACKEND server.
  7. Alternatively, you may configure a GitHub Action to send the webhook. I recommend using Workflow Webhook Action or HTTP Request Action.
  8. To test this setup, you should be able to push a change to the GitHub repository, and see that change reflected on the website automatically.
  9. For offline grading, add the instructor (@russfeld) to the repository as maintainers, and submit the repository and URL where the files can be found in your grading packet. Provided the webhook works correctly, they should be able to see a pushed change to the repository update the website.

Of course, you may need to modify your firewall configuration to allow incoming connections for Webhook! If your firewall is disabled and/or not configured, there will be a deduction of up to 10% of the total points on this lab

Tip

Since the Webhook process runs as the root user on BACKEND, you’ll need to make sure a set of SSH keys exist in the root user’s home folder /root/.ssh/ and add the public key from that directory to your GitHub account. You should then use the root account (use sudo su - to log in as root) to run git pull from the appropriate directory on BACKEND at least once so you can accept the SSH fingerprint for the GitHub server. This helps ensure that root can properly run the script. –Russ

Resources


Task 4: Submit Files

This lab may be graded completely offline. To do this, submit the following items via Canvas:

  1. Task 1: An archive file containing a README document as well as any files or information needed as part of the backup of the Ubuntu web application installed in Lab 6.
  2. Task 2: The URL and password of your Checkmk instance, clearly showing data from both FRONTEND and BACKEND.
  3. Task 3: A GitLab/GitHub repository URL and a URL of the website containing those files. Make sure the instructor is added to the repository as maintainers. They should be able to push to the repository and automatically see the website get updated.

If you are able to submit all 4 of the items above, you do not need to schedule a grading time. The instructor or TA will contact you for clarification if there are any questions on your submission.

For Tasks 2 - 3, you may also choose to do interactive grading, especially if you were unable to complete it and would like to receive partial credit.

Task 5: Schedule A Grading Time

If you are not able to submit information for all 3 tasks for offline grading, you may contact the instructor and schedule a time for interactive grading. You may continue with the next module once grading has been completed.

Monitoring

YouTube Video

Resources

Video Transcript

One major part of any system administrator’s job is to monitor the systems within an organization. However, for many groups, monitoring is sometimes seen as an afterthought, since it requires additional time and effort to properly set up and configure a monitoring system. However, there are a few good reasons to consider investing some time and effort into a proper monitoring system.

First and foremost, proper monitoring will help you keep your systems up and running by alerting you to errors as soon as they occur. In many organizations, systems are in use around the clock, but often IT staff are only available during the working day. So, if an error occurs after hours, it might take quite a while for the correct person to be contacted. A monitoring system, however, can contact that person directly as soon as an error is detected. In addition, it will allow you to more quickly respond to emergencies and cyber threats against your organization by detecting them and acting upon them quickly. Finally, a proper monitoring system can even save money in the long run, as it can help prevent or mitigate large system crashes and downtime.

When monitoring your systems, there are a number of different metrics you may be interested in. First, you’ll want to be able to assess the health of the system, so knowing such things as the CPU, memory, disk, and network usage are key. In addition, you might want to know what software is installed, and have access to the log files for any services running on the system. Finally, many organizations use a monitoring system to help track and maintain inventory, as systems frequently move around an organization.

However, beyond just looking at individual systems themselves, you may want to add additional monitoring to other layers of your infrastructure, such as network devices. You may also be involved in monitoring physical aspects of your environment, such as temperature, humidity, and even vibrations, as well as the data from security cameras and monitors throughout the area. In addition, many organizations set up external alerts, using tools such as Google Analytics to be alerted when their organization is in the news or search traffic suddenly spikes. Finally, there are many tools available to help aggregate these various monitoring systems together into a single, cohesive dashboard, and allow system administrators to set customized alert scenarios.

One great example of a system monitoring system is the one from K-State’s own Beocat supercomputer, which is linked in the resources section below this video. They use a frontend powered by Ganglia to provide a dashboard view of over 400 unique systems. I encourage you to check it out and see how much interesting data they are able to collect.

Of course, there are many paid tools for this task as well. This is a screenshot of GFI LanGuard, one of many tools available to perform system monitoring and more. Depending on your organization’s needs, you may end up reviewing a number of tools such as this for your use.

In the next few videos, I’ll discuss some monitoring tools for both Windows and Linux. As part of your lab assignment, you’ll install a monitoring system on your cloud servers running on DigitalOcean, giving you some insight into their performance.

Windows Monitoring

YouTube Video

Resources

Video Transcript

In this video, I’m going to review some of the tools you can use to help monitor the health and performance of your Windows systems. While some of these tools support collecting data remotely, most of them are designed for use directly on the systems themselves. There are many paid products that support remote data collection, as well as a few free ones as well. However, since each one is unique, I won’t be covering them directly here. If you’d like to use such a system, I encourage you to review the options available and choose the best one for your environment.

First, of course, is the built-in Windows Task Manager. You can easily access it by right-clicking on the taskbar and choosing it from the menu, or, as always, by pressing CTRL+ALT+DELETE. When you first open the Task Manager on a system, you may have to click a button to show the more details. The Task Manager gives a very concise overview of your system, listing all of the running processes, providing graphs of system performance and resource usage, showing system services and startup tasks, and even displaying all users currently logged-in to the system. It serves as a great first tool to give you some quick insight into how your system is running and if there are any major issues.

From the Performance tab of the Task Manager, you can access the Resource Monitor. This tool gives more in-depth details about a system’s resources, showing the CPU, memory, disk, and network usage graphs, as well as which processes are using those resources. Combined with the Task Manager, this tool will help you discover any processes that are consuming a large amount of system resources, helping you diagnose performance issues quickly.

The Windows Performance Monitor, which can be found by searching for it via the Start Menu, also gives access to much of the same information as the Resource Monitor. It also includes the ability to produce graphs, log files, and remotely diagnose another computer. While I’ve found that it isn’t quite as useful as the Resource Monitor, it is yet another tool that could be used to help diagnose performance issues.

Finally, the Windows Reliability Monitor, found on the Control Panel as well as via the Start Menu, provides a combined look at a variety of data sources from your system. Each day, it rates your system’s reliability on a scale of 1 to 10, and gives you a quick look at pertinent log entries related to the most recent issues on your system. This is a quick way to correlate related log entries together and possibly determine the source of any system reliability issues.

The Sysinternals Suite of tools available from Microsoft also includes several useful tools for diagnosing and repairing system issues. While I encourage you to become familiar with the entire suite of tools, there are four tools in particular I’d like to review here.

First is Process Explorer. It displays information about running processes on your system, much like the Task Manager. However, it provides much more information about each process, including the threads, TCP ports, and more. Also, Process Explorer includes an option to replace Task Manager with itself, giving you quick and easy access to this tool.

Next is Process Monitor. We’ve already discussed this tool way back in Module 1. It gives you a deep view into everything your Windows operating system is doing, from opening files to editing individual registry keys. To my knowledge, no other tool gives you as much information about what your system is doing, and by poring over the logged information in Process Monitor, you can discover exactly what a process is doing when it causes issues on your system.

Another tool we’ve already discussed is TCPView, which allows you to view all TCP ports open on your system, as well as any connected sockets representing ongoing network connections on your system. If a program is trying to access the network, this tool is one that will help you figure out what is actually happening.

Lastly, Sysinternals includes a tool called Autoruns, which allows you to view all of the programs and scripts that run each time your system is started. This includes startup programs in the Start Menu, as well as programs hidden in the depths of the Windows Registry. I’ve found this tool to be particularly helpful when a pesky program keeps starting with your system, or if a malicious program constantly tries to reinstall itself because another hidden program is watching it.

Windows also maintains a set of log files that keep track of system events and issues, which you can easily review as part of your monitoring routine. The Windows logs can be found by right-clicking on the Start Button, selecting Computer Management, then expanding the Event Viewer entry in the left-hand menu, and finally selecting the Windows Logs option there. Unlike log files in Ubuntu, which are primarily text-based, the Windows log files are presented in a much more advanced format here. There are a few logs available here, and they give quite a bit of useful information about your system. If you are fighting a particularly difficult issue with Windows, the log files might contain useful information to help you diagnose the problem. In addition, there are several tools available to export these log files as text, XML, or even import them into an external system.

Finally, you can also install Wireshark on Windows to capture network packets, just as we did on Ubuntu in Module 3. Wireshark is a very powerful cross-platform tool for diagnosing issues on your network.

This is just a quick overview of the wide array of tools that are available on Windows to monitor the health and performance of your system. Hopefully you’ll find some of them useful to you as you continue to work with Windows systems, and I encourage you to search online for even more tools that you can use in your work.

Ubuntu Monitoring

YouTube Video

Resources

Video Transcript

Ubuntu also has many useful tools for monitoring system resources, performance, and health. In this video, I’ll cover some of the more commonly used ones.

First and foremost is the System Monitor application. It can easily be found by searching for it on the Activities menu. Similar to the Task Manager on Windows, the System Monitor allows you to view information about all the running processes, system resources, and file systems on Ubuntu. It is a very useful tool for discovering performance issues on your system.

Of course, on Ubuntu there is an entire universe of command-line tools that can be used to monitor the system as well. Let’s take a look at a few of them.

First is top. Similar to the System Monitor, top allows you to see all running processes, and it sorts them to show the processes consuming the most CPU resources first. There are commands you can use to change the information displayed as well. In addition, you may also want to check out the htop command for a more advanced view into the same data, but htop is not installed on Ubuntu by default.

In addition to top, you may also want to install two similar commands, iotop and iftop. As you might expect, iotop shows you all processes that are using IO resources, such as the file systems, and iftop lists the processes that are using network bandwidth. Both of these can help you figure out what a process is doing on your system and diagnose any misbehavior.

To view information about memory resources, you can use either the vmstat or free commands. The free command will show you how much memory is used, and vmstat will show you some additional input and output statistics as well.

Another command you may want to install is iostat, which is part of the sysstat package. This command will show you the current input and output information for your disk drives on your system.

In addition, you can use the lsof command to see which files on the system are open, and which running process is using them. This is a great tool if you’d like to figure out where a particular file is being used. I highly recommend using the grep tool to help you filter this list to find exactly what you are looking for.

In Module 3, we discussed a few network troubleshooting tools, such as the ip and ss commands, and Wireshark for packet sniffing. So, as you are working on monitoring your system, remember that you can always use those as well.

Finally, you can also use the watch command on Ubuntu to continuously run a command over and over again. For example, I could use watch tail /var/log/syslog to print the last few lines of the system log, and have that display updated every two seconds. This command is very handy when you need to keep an eye on log files or other commands as you diagnose issues. If you are running a web server, you may also want to keep an eye on your Apache access and error logs in the same manner.

There are also a few programs for Ubuntu that integrate several different commands into a single dashboard, which can be a very useful way to monitor your system. The two that I recommend using are glances and saidar. Both can easily be installed using the APT tool. Glances is a Python-based tool that reads many different statistics about your system and presents them all in a single, unified view. I especially like running it on systems that are not virtualized, as it can report information from the temperature sensors built-in to many system components as well. It even has some built-in alerts to help you notice system issues quickly. Saidar is very similar, but shows a slightly different set of statistics about your system. I tend to use both on my systems, but generally I will go to Glances first.

As I mentioned a bit earlier, Ubuntu stores a plethora of information in log files stored on the system. You can find most of your system’s log files in the /var/log directory. Some of the more important files there are auth.log which tracks user authentications and failures, kern.log which reports any issues directly from the Linux kernel, and syslog which serves as the generic system log for all sorts of issues. Some programs, such as Samba and Apache, may create their own log files or directories here, so you may want to check those as well, depending on what your needs are. As with Windows, there are many programs that can collect and organize these log files into a central logging system, giving you a very powerful look at your entire infrastructure.

Lastly, if you are running systems in the cloud, many times your cloud provider may also provide monitoring tools for your use. DigitalOcean recently added free monitoring to all droplets, as long as you enable that feature. You can view the system’s monitoring output on our DigitalOcean dashboard under the Graphs option. In addition, you can configure alert policies to contact you when certain conditions are met, such as sustained high CPU usage or low disk space. I encourage you to review some of the available options on DigitalOcean, just to get an idea of what is available to you.

That should give you a quick overview of some of the tools available on an Ubuntu system to monitor its health and performance. There are a number of tools available online, both free and paid, that can also perform monitoring tasks, collect that data into a central hub, and alert you to issues across your system. As part of the lab assignment, you’ll configure either Munin or Ganglia to discover how those tools work.

Backups

YouTube Video

Resources

Video Transcript

Backups are an important part of any system administrator’s toolkit. At the end of the day, no amount of planning and design can completely remove the need for high quality, reliable backups of the data that an organization needs to function. As a system administrator, you may be responsible for designing and executing a backup strategy for your organization. Unfortunately, there is no “one size fits all” way to approach this task, as each organization’s needs are different. In this video, I’ll discuss some different concepts and techniques that I’ve acquired to help me think through the process of creating a backup strategy.

To begin, I like to ask myself the classic questions that any journalist begins with: “Who? What? When? Where? Why? How?” Specifically, I can phrase these to relate to the task of designing a backup strategy: “Who has the data? What data needs to be stored? When should it be backed up? Where should those backups be stored? Why can data loss occur in this organization? How should I create those backups?” Let’s look at each question in turn to see how that affects our overall backup strategy.

First, who has the data? I recommend starting with a list of all of the systems and people in your organization, and getting a sense of what data each has. In addition, you’ll need to identify all of the servers and network devices that will need analyzed. Beyond those systems, there may be credentials and security keys stored in safe locations around the organization that will need to be considered. Finally, especially when dealing with certain types of data, you may also need to consider ownership of the data and any legal issues that may be involved, such as HIPAA for health data, or FERPA for student data. Remember that, storing data for your customers doesn’t mean that you have any ownership of that data or the intellectual property within it.

Next, once you’ve identified where the data may be stored, you’ll need to consider the types of data that need to be stored. This could include accounting data and personnel files to help your organization operate smoothly, but also the web assets and user data that may be stored on your website. Beyond that, there is a plethora of data hiding on your systems themselves, such as the network configuration, filesystems, and even the metadata for individual files stored on the systems. At times, this can be the most daunting step once you start to consider the large amount of decentralized data that most organizations have to deal with.

Once you have a good idea of the data you’ll need to back up, the next question you’ll have to consider is when to make your backups. There are many obvious options, from yearly all the way down to instantaneously. So, how do you determine which is right for your organization?

One way to look at it is to consider the impact an issue might have on your organization. Many companies use the terms Recovery Point Objective, or RPO, and Recovery Time Objective, or RTO, to measure these items. RPO refers to how much data may be lost when an issue occurs, and RTO measures how long the systems might be unavailable while waiting for data and access to be restored. Of course, these are goals that your organization will try to meet, so the actual time may be different for each incident.

This timeline shows how they are related. In this example, your organization has created a backup at the point in time labelled “1” in this timeline. Then, some point in the future, an incident occurs. From that point, your organization has an RPO set, which is the maximum amount of data they would like to lose. However, the actual data loss may be greater, depending on the incident and your organization’s backup strategy. Similarly, the RTO sets a desired timeline for the restoration of data and service, but the actual time it takes to restore the data may be longer or shorter. So, when creating a backup strategy, you’ll need to choose the frequency based on the RPO and RTO for your organization. Backups made less frequently could result in a higher RPO, and backups that are less granular or stored offline could result in a higher RTO.

You will also have to consider where you’d like to store the data. One part of that equation is to consider the type of storage device that will be used. Each one comes with its own advantages and disadvantages, in terms of speed, lifetime, storage capacity, and, of course, cost. While it may seem counter-intuitive, one of the leading technologies for long-term data storage is still the traditional magnetic tape, as it already has a proven lifetime of over 50 years, and each tape can hold several terabytes of data. In addition, tape drives can sustain a very high read and write speed for long periods of time. Finally, for some very critical data, storing it in a physical, on-paper format, may not be a bad idea either. For passwords, security keys, and more, storing them completely physically can prevent even the most dedicated hacker from ever accessing them.

Beyond just the storage media, the location should also be considered. There are many locations that you could store your backup data, relative to where the data was originally stored. For example, a backup can be stored online, which means it is directly and readily accessible to the system it came from in case of an incident. Similarly, data can be stored near-line, meaning that it is close at hand, but may require a few seconds to access the data. Backups can also be stored offline, such as on a hard disk or tape cartridge that is disconnected from a system but stored in a secure location.

In addition, data can be stored in a variety of ways at a separate location, called an offsite backup. This could be as simple as storing the data in a secure vault in a different building, or even shipping it thousands of miles away in case of a natural disaster. Lastly, some organizations, such as airlines or large retail companies, may even maintain an entire backup site, sometimes called a “disaster recovery center,” which contains enough hardware and stored backup data to allow the organization to resume operations from that location very quickly in the event of an incident at the primary location.

When storing data offsite or creating a backup site, it is always worth considering how that site is related to your main location. For example, could both be affected by the same incident? Are they connected to the same power source? Would a large hurricane possibly hit both sites? Do they use the same internet service provider? All of these can be a single point of failure that may disable both your primary location and your backup location at the same time. Many companies learned this lesson the hard way during recent hurricanes, such as Katrina or Sandy. I’ve posted the story of one such company in the resources section below this video.

Optimization is another major concern when considering where to store the data. Depending on your needs, it may be more cost-effective or efficient to look at compressing the backup before storing it, or performing deduplication to remove duplicated files or records from the company-wide backup. If you are storing secure data, you may also want to encrypt the data before storing it on the backup storage media. Finally, you may even need to engage in staging and refactoring of your data, where backup data is staged in a temporary location while it is being fully stored in its final location. By doing so, your organization can continue to work with the data without worrying about their work affecting the validity of the backup.

Another important part of designing a backup strategy is to make sure you understand the ways that an organization could lose data. You may not realize it, but in many cases the top source of data loss at an organization is simply user error or accidental deletion. Many times a user will delete a file without realizing its importance, only to contact IT support to see if it can be recovered. The same can happen to large scale production systems, as GitLab experienced in 2017. I’ve posted a link to their very frank and honest post-mortem of that event in the resources section below this video.

There are many other ways that data could be lost, such as software or hardware failure, or even corruption of data stored on a hard drive itself as sectors slowly degrade over time. You may also have to deal with malicious intent, either from insiders trying to sabotage a system, all the way to large-scale external hacks. Lastly, every organization must consider the effect natural disasters may have on their data.

Finally, once you’ve identified the data to be backed up, when to create the backups, and where to store them, the last step is to determine how to create those backups themselves. There are many different types of backups, each with their own features and tradeoffs. First, many companies still use an unstructured backup, which I like to think of as “just a bunch of CDs and flash drives” in a safe. In essence, they copy their important files to an external storage device, drop it in a safe, and say that they are covered. While that is definitely better than no backup at all, it is just the first step toward a true backup strategy.

Using software, it is possible to create backups in a variety of ways. The first and simplest is a full backup. Basically, the system makes a full, bit by bit, copy of the data onto a different storage device. While this may be simple, it also requires a large amount of storage space, usually several times the size of the original data if you’d like to store multiple versions of the data.

Next, software can help you create incremental backups. In this case, you start with a full backup, usually made once a week, then each following day’s backup only stores the data that has changed in the previous day. So, on Thursday, the backup only stores data that has changed since Wednesday. Then, on Friday, if the backup needs to be restored, the system will need access to all backups since the full backup on Monday in order to restore the data. So, if any incremental backup is lost, the system may be unrecoverable to its most recent state. However, this method requires much less storage than a full backup, so it can be very useful.

Similarly, differential backups can also be used. In this case, each daily backup stores the changes since the most recent full backup. So, as long as the full backup hasn’t been lost, the most recent full backup plus a differential backup is enough to restore the data. This requires a bit more storage than an incremental backup, but it can also provide additional data security. It is also possible to keep fewer full backups in this scenario to reduce the storage requirements.

Finally, some backups systems also support a reverse incremental backup system, sometimes referred to as a “reverse delta” system. In this scenario, each daily backup consists of an incremental backup, then the backup is applied to the existing full backup to keep it up to date. Then, the incremental backups are stored as well, allowing those changes to be reversed if needed. This allows the full backup to be the most recent one, making restoration much simpler than traditional incremental or differential backups. In addition, only the most recent backup is needed to perform a full recovery. However, as with other systems, older data can be lost if the incremental backups are corrupted or missing.

Lastly, there are a few additional concerns around backup strategies that we haven’t discussed in this video. You may need to consider the security of the data stored in your backup location, as well as how to go about validating that the backup was properly created. Many organizations perform practice data recoveries often, just to make sure the process works properly.

When creating backups, your organization may have a specific backup window, such as overnight or during a planned downtime. However, it may be impossible to make a backup quickly enough during that time. Alternatively, you may need to run backups while the system is being used, so there could be significant performance impacts. In either case, you may need to consider creating a staging environment first.

Also, your organization will have to consider the costs of buying additional storage or hardware to support a proper backup system. Depending on the metric used, it is estimated that you may need as much as 4 times the storage you are using in order to create proper backups. Lastly, as we discussed earlier, you may need to consider how to make your systems available in a distributed environment, in case of a large-scale incident or natural disaster.

Of course, don’t forget that backups are just one part of the picture. For many organizations, high availability is also a desired trait, so your system’s reliability design should not just consist of backups. By properly building and configuring your infrastructure, you could buy yourself valuable time to perform and restore backups, or even acquire new hardware in the case of a failure, just by having the proper system design.

Hopefully this overview of how to create a backup strategy has given you plenty of things to consider. As part of the lab assignment, you’ll work with a couple of different backup scenarios in both Windows and Ubuntu. I also encourage you to review your own personal backup plan at this point, just to make sure you are covered. Sometimes it is simple to analyze these issues on someone else’s environment, but it is harder to find our own failures unless we consciously look for them.

Windows Backups

YouTube Video

Resources

These resources mostly refer to Windows Server 2012 or 2016, but should work for 2019 as well.

Video Transcript

Windows includes several tools for performing backups directly within the operating system. In this video, I’ll briefly introduce a few of those tools.

First is the Windows File History tool. You can find it in the Settings app under Update & Security. Once there, choose the Backup option from the menu on the left. To use File History, you’ll need to have a second hard drive available. It can be either another internal drive, another partition on the same drive, or an external hard drive or flash drive. Once it is configured, File History will make a backup of any files that have changed on your system as often as you specify, and it will keep them for as long as you’d like, provided there is enough storage space available. By default, it will only back up files stored in your user’s home directory, but you can easily add or exclude additional folders as well.

Right below the File History tool is the Backup and Restore tool from Windows 7. This tool allows you to create a full system image backup and store it on an external drive. This is a great way to create a single full backup of your system if you’d like to store one in a safe location. However, since Windows 10 now includes options to refresh your PC if Windows has problems, a full system image is less important to have now than it used to be.

However, one tool you may want to be familiar with is the System Restore tool. Many versions of Windows have included this tool, and it is a tried and true way to fix some issues that are caused by installing software or updates on Windows. You can easily search for the System Restore tool on the Start Menu to find it. In essence, System Restore automatically creates restore points on your system when you install any new software or updates, and gives you the ability to undo those changes if something goes wrong. System Restore isn’t a full backup itself, but instead is a snapshot of the Windows Registry and a few other settings at that point in time. If it isn’t enabled on your system, I highly recommend enabling it, just for that extra peace of mind in case something goes wrong. As a side note, if you’ve ever noticed a hidden folder named “System Volume Information” in Windows, that folder is where the System Restore backups are stored. So, I highly recommend not touching that folder unless you really know what you are doing.

In the rare instance that your operating system becomes completely inoperable, you can use Windows to create a system recovery drive. You can then use that drive to boot your system, perform troubleshooting steps, and even reinstall Windows if needed. However, in most instances, I just recommend keeping a copy of the standard Windows installation media, as it can perform most of those tasks as well.

Windows also has many tools for backing up files to the cloud. For example, Windows has Microsoft’s OneDrive software built-in, which will allow you to automatically store files on your system in the cloud as well. While it isn’t a true backup option, it at least gives you a second copy of some of your files. There are many other 3rd-party tools that perform this same function that you can use as well.

Finally, Windows Server includes the Windows Server Backup tool, which is specifically designed for the types of data that might be stored on a server. You can use this tool to create a backup of your entire server, including the Active Directory Domain Services data. Losing that data could be catastrophic to many organizations, so it is always recommended to have proper backups configured on your domain controllers. As part of the lab assignment, you’ll use this tool to backup your Active Directory Domain, and then use that backup to restore an accidentally deleted entry, just to make sure that it is working properly.

Of course, there are many other backup tools available for Windows, both free and paid, that offer a variety of different features. If you are creating a backup strategy for your organization, you may also want to review those tools to see how well they meet your needs.

Ubuntu Backups

YouTube Video

Resources

Video Transcript

There are many different ways to back up files and settings on Ubuntu. In this video, I’ll discuss a few of the most common approaches to creating backups on Ubuntu.

First, Ubuntu 18.04 includes a built-in backup tool, called “Déjà Dup,” that will automatically create copies of specified folders on your system in the location of your choice. It can even store them directly on the cloud, using either Google or Nextcloud as the default storage location, as well as local or network folders. This feature is very similar to the File History feature in Windows 10.

In addition, many system administrators choose to write their own scripts for backing up files on Ubuntu. There are many different ways to go about this, but the resources from the Ubuntu Documentation site, linked below the video, give some great example scripts you can start with. You can even schedule those scripts using Cron, which is covered in the Extras module, to automatically perform backups of your system.

When backing up files on your system, it is very important to consider which files should be included. On Ubuntu, most user-specific data and settings are stored in that user’s home folder, though many of them are included in hidden files, or “dotfiles.” So, you’ll need to make sure those files are included in the backup. In addition, most system-wide settings are stored in the /etc folder, so it is always a good idea to include that folder in any backup schemes. Finally, you may want to include data from other folders, such as /var/www/ for Apache websites.

If you are running specific software on your system, such as MySQL for databases or OpenLDAP for directory services, you’ll have to consult the documentation for each of those programs to determine the best way to back up that data. As part of this module’s lab assignment, you’ll be creating a backup of a MySQL database to get some experience with that process.

There are also some tools available for Ubuntu to help create backups similar to the System Restore feature of Windows. One such tool is TimeShift. I’ve linked to a description of the tool in the resources section below the video if you’d like to know more.

Finally, as with Windows, there are a large number of tools, both paid and free, available to help with creating, managing, and restoring backups on Ubuntu and many other Linux distributions. As you work on building a backup strategy for an organization, you’ll definitely want to review some of those tools to see if they adequately meet your needs.

That concludes Module 7! As you continue to work on the lab assignment, feel free to post any questions you have in the course discussion forum on Canvas. Good luck!