Map Reduce

Resources

Slides

Video Script

One of the last things I want to talk about here with big data is some of the algorithms that we can actually use to work with this sheer amount of information. So one that I want to highlight here is called MapReduce. Oh, MapReduce is a very well known algorithm and big data realm. And now it’s been, of course, transformed significantly, since its original inception, inception, to handle even larger amounts of information. But the idea here is that we take a very large amount of information, let’s say text, and then we map it to smaller parts, break it out, and then we recombine that and to produce a final result. Okay. And so if you can use this as an example, right, if we’re trying to, let’s say, sort of deck of cards, okay, if I asked if I give you a whole bunch, or if I give you a full deck of cards that is completely shuffled, but I want you to sort it out in numerical order, as well as the suits would actually take you a little bit of time to actually achieve that task. But if I were allow it, if I gave you a deck of cards in a group, well, so let’s say I gave a group of 10 people, one single deck of cards, and I said, sort do the same thing, it will take them significantly less time than it will if I gave just one person a deck of cards to actually achieve that end result.

So that’s the idea of MapReduce, we partition our information out into very small parts. And then each of those small parts has the same task done to it. And once that task has been executed on the small parts, all of the end results are then combined to produce the final results. So let’s take a look at another example here with word count, which is pretty a real classic example of how MapReduce works. So our input here is a very simple section of texts. So a bunch of different words, dear bear, river, car, car, river, deer car bear. And so you can imagine this being a very large book or something like that. And we want to count the count the occurrences of each word in our in our data set. So first thing that we do here, let’s split this data out. So let’s say that each line of text here is our initial split. So deer, Bear River, car, car, river, and deer car bear. So we have these three, these three data sets that are that are our big data set has been split into these individual data sets. Where each the key value we have key value pair, where the key is this as a document, the value is the the text that we actually contain. So each of these documents here are then going to be mapped to a task. And our task here is to count the word occurrences.

So in this mapping task, I’m going to map each word to a number. So each word of course, and individualized is only going to occur once. So deer occurs once bear occurs once and river occurs once. The key here is going to be word and the value here is going to be of course, the word count. As you can see down here in this middle example, where we have two cars, that’s okay, because it’s individual tasks, remember, so each car is still going to be one key value pair here. Because the important part actually comes in the next step. And the next few steps here. So we’re actually going to shuffle this out on the shuffling process is going to take care of essentially sorting the result of our mapping process. Because once it’s actually sorted, it’s a lot easier to easier to actually reduce and combine. So when we actually shuffle all of the bears and get put in one bin, all the cars get put in one bin, all the deer get put in one bin and all of the rivers get put in one bin.

And then all that happens here is actually the reducing so we actually combining one more step actually combining the information. So we sum the word counts. So bear occurred twice, Parker three, deer two and river two. So we’ve taken all of the individual words here, counted them and sort of the mount and summed them. And then the final reduce phase is we’ve combined this all back into a single list, where the key is the word and the value is the total word count over the entire day. To set. But you can imagine this to be significantly faster than having a one single process or one single out or one single computer doing this, we can use this on things like balcatta, a distributed computer system where we can throw a split the data set up and onto a lot of different processors and have each processor each thread actually execute the mapping, shuffling and reducing task. And then they all come in back together at the end to form the final results. But this is just one of the big data algorithms out there. There’s obviously a significant amount of other types of techniques and algorithms and tasks out there, that big data can actually accomplish. So we’ve just scratched the surface here. But if you’re interested in learning more, please reach out and we happy to actually connect to you with more resources.