Compression - Run Length Encoding

Resources

Slides

Video Script

In this module, we will spend a lot of time talking about compression and error checking in our computers. We’ve already talked about compression a little bit in the data encoding module earlier this semester. But now we’re going to spend an entire module talking about just these two topics. First, we’re going to talk about in compression. What do you think compression is and why do you think it’s useful? As you might recall, compression is reducing the amount of space a particular piece of data needs to be stored on a computer. And if we think about that, that may not even really make sense, how can we take the piece of data and store it in fewer Bits and Bytes than what it originally was? It turns out, there are a lot of different ways to do that. We can do all sorts of different things by replacing repeated chunks of data or re-encoding the data in such a way that we can interpret it differently in our computer systems. And we’ll take a look at that in this module.

Why do you think we would do that? Well, computer resources are very expensive. For example, think about the storage space in your computer. While you might have a couple of terabytes. If you’ve ever played video games, or downloaded a lot of videos, you’ll find that that space gets eaten up very, very quickly by those large files. And so we can use compression to reduce the amount of space that those files take up and store them in fewer hard drives than it would take to store the raw files. Likewise, we can think about transmission bandwidth across the internet. If we need to send a large file across the internet. If our internet connection is very slow, it might be better to compress the file before we send it and then allow the other user on the other end to decompress that file when they receive it. However, of course, this does create a trade off, we’re spending computation time to compress and decompress the data in hopes of saving transmission time in storage space when we actually store or transmit the file. And so depending on the amount of computational power we have, and the amount of transmission or storage space we have available, it may or may not make sense to compress data. But in general, compression is usually a good thing and helps us save more data in a smaller amount of space, and transmit lots of data very quickly across the internet.

So let’s take a look at one example of compression called run length encoding. In run length encoding, we’re looking for runs or repeated sequences in our data. And then we’re going to going to replace them with a shorter version of that data in our actual store data itself. Usually that shorter version would be the sequence that we’re repeating, and account of how many times it gets repeated. So let’s take a look. Here’s an example of data that we could send via run length encoding, we’re using characters and recall of course, the characters can be stored as binary values using ASCII the American Standard Code for Information Interchange. So while we’re representing this data as text, it would actually be stored and transmitted as binary on our computers.

So in run length encoding, we would look for repeated characters and try and replace them with shorter versions of themselves. For example, if we look at these W’s at the top, we can see that they’re an awful lot of them. So let’s count them, we have 1 2 3 4 5 6 7 8 9 10 11 12, there are 12 W’s. So in our in our data, we would say that we have 12 W, then we have one B, so we just write one B, then we have 12 more W’s. So if you count those out, there will be 12, W’s, and so on. And so with run length encoding, we would be able to count the number of each characters and put a number followed by the character that shows up. So if we encode all of this data with run length encoding, we would get 12 W, 1 B, 12 W, 3 B, 24 W, 1 B, 15 W.

As we talked about earlier, the characters W and B can also be stored as binary. And of course, these numbers can be stored as binary. So it should be pretty easy to figure out how we can convert this particular string of data into a binary string that we would send along with our data.

But there might be a problem. If we think about all of that being one long string of binary numbers, how could we tell which parts of it are the numbers and which parts of it are text. And remember, the text itself is just numbers as well. So maybe we need to come up with some sort of a scheme that allows us to structure that data so that we know where the numbers are and where the text is. So let’s take a look at that.

Here’s that same example again, but how do you think we could write it so that we can easily tell which ones are the text and which ones are the numbers? Take a minute to think about that before you move on. A better way to think about this would be to use escape sequences. So in escape coding, what we would do is we would have a double character representing an escape sequence. For example WW; the WW being a repeated character would reduce down to just a single W. But then afterwards, we would have a number. And so looking at this bottom, we see WW followed by 12. The ww is our escape sequence saying we are going to repeat the letter W. And the next bit of binary data is going to be the number of times we repeat that character. So we see WW 12. Then we see just one B. And because B is not repeated, we just assume that that is text.

Next, we have WW again. And since that’s a repeated character, that will tell us that we’re going to repeat W and the next bit of data is going to be binary, that is the number of times we repeat that character. So then we have WW 12. Now we have three B’s, so we have BB 3. Then we have WW 24, a single B, then WW 15. So if you look at this, that is a little bit longer than the example we saw earlier, because we have to double some of the characters. But this example is much easier to read for our computer. We know that every character is text, except if we see the same character twice. And then we will know that the eight bits after that represent a number that we could use to repeat that character that many times. So now that we’ve seen an example of run length encoding, we’re going to let you give it a try in the next quiz.