Chapter 20

Big Data

Image Source: theconversation.com

Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four

YouTube Video

Introduction

YouTube Video

Resources

Slides

Video Script

Welcome back everyone. In this video series, we’re going to be taking a look at big data. So exactly how big is big data? So here’s a slide of the metric prefixes. And so which ones do you think constitute as big data? or big data? Today, we’re talking about data anywhere from gigabytes to petabytes and beyond. And the future, we may even be dealing with exabytes of data. I said, gigabytes, right. But I can have a single video or even in these lecture series, that might be a gigabytes worth of data. And in the grand scheme of things, right, that’s just one video. But if you break down that gigabyte worth of data, you can actually start to chunk that up into lots of different lots of different parts, right? If we’re recording in 4k, for example, as a significant number of frames per second in the video, along with a significant number of pixels. And if you tack on the sound information with that as well. It is a significant amount of information that can be packed into a single gigabyte, big data is going to deal with or one, how do we actually store all of this information? How do we use this information? And how do we get information out of it, right, because we can have lots of information. But without any algorithms or way or ways of actually presenting that information becomes pretty much useless at the end of the day.

So where is all of this data coming from? Right? A lot of what we actually come across in the current age is web 2.0 stuff, right? Social media, and video streaming services as well. So if we look at 2017 versus 2018, we can actually see a lot of different changes here. Specifically, right, we have 4.1 million videos and YouTube, versus 4.3 million videos watching YouTube and within the second year, so not a huge jump. But if we take a look at Netflix, the popularity of Netflix is has drastically increased from 2017 to 2018, from 70,000 hours, to 266,000 hours of video watched every single minute on the internet. And likewise with things like Snapchat, Twitter, Facebook, emails, all sorts of different things. And so this slowly evolves over time, as we even get to some of the older stuff, right from 2016 to 2019, which is previous year. So we went from 2.7 8 million views on YouTube to four and a half million on YouTube, and 700,000 pages on or 700,000 logins on Facebook to 1 million, the number of Google searches has drastically increased. So 2.4 million to 3.8 million Netflix, right, we’re up to almost 700,000 hours of Netflix wash per minute. And things like online shopping have also drastically increase sharing of pictures, things like Instagram, Snapchat, and music, right? Especially smart home devices as well. And even things like twitch are starting to gain significant in popularity as a mainstream streaming service as well. You kind of imagine the sheer amount of information that streams online every 60 seconds is quite mind boggling.

How do we actually make infrastructure that can support the scale of streaming of that information, live and in High Definition? Or in a quick manner, right? Because we don’t want to have to sit here and wait five hours in order to download a YouTube video like what we’d had to do in the late 90s, early 2000s things of that nature. So how do we create an infrastructure that can support this type of information? And how do we create software that allows a normal user to interact with that data in a meaningful way.

And that’s really where the Big Data stack comes into play. We can use all this all these tools and techniques not only to store information, but also also to view that information. This technique creates somewhat of a stacked approach here and management of the data where the bottom layer is going to represent all the providers of that information. And if we look here, we have things like MySQL, Postgres, Hadoop, all of these sorts of things, our databases, database technology, and some of them things like Hadoop are specialized in big data. So doing performing operations on very large amounts of information very quickly, and so We have speed on one access and scale on the other. So how fast is that particular technology at working with information? And how much information can it work with per second. So we have megabytes to petabytes, and batch processing, meaning we can process stuff on demand, but it takes a little bit of time for us to actually do it. And then real time, meaning that when we ask for our data, our data is there instantaneously. And we do not and we don’t have to actually wait for it to be produced.

In the middle here on our stack, this is going to be where all the analysis of that information is, at the bottom is where the data is actually being stored and provided into the analysis layer, where we use packages, like things like Sai, pi, and NumPy, which are Python, Python packages for scientific analysis, also things like mo hoot, which is a machine learning library, and all sorts of other information processing libraries that live at this layer. And this is where we’re going to do some things like machine learning artificial intelligence, we’re going to crunch that data transform that data to prepare it to be displayed to the user, which is done at the service layer.

Now the service layer at the top is where we all primarily interact with online. So when we go to Amazon, and we see those recommended items, or even if we open up Google News, or even Flipboard, or Pinterest, or whatever, those user curated websites and mobile apps, where we have our news articles that are presented to us, and a lot of those are curated for your viewing interest and reading interests. And so how do we take all of the data all of the news articles in the entire world and collapse them to a smaller bite of information in a smaller chunk bite sized chunks that you can consume as a user? How do you consume the data that you want to consume? And so that’s where these services come into play things like news curation, even weather forecasting, pricing of items online recommendation of items online, and even things like online reputation. So how do you gain reputation or fame on the internet for SNAP something that’s trending on YouTube, or Twitter, or Instagram or something like that.

But where is all this going? So we talked about some of the services that are provided using big data. But a lot of this is primarily used for business, right? We consume as users a significant amount of this information. But underneath the hood, a lot of this is being driven by business objectives as well. So customization of services that that everyone provides. So anymore, right with web 2.0, everything that we use online is a tailored service to you as the user. So your login information, if you have it on Google that is being tracked, right and all of that information that you do would online searches, websites that you visit, you have that digital footprint, that digital footprint allows websites and companies to provide a tailored service to you specifically, also allows companies to help react to certain market trends a lot easier. So what’s trending at the moment. And they can make business decisions on that information. They real time optimizations for identifying certain costs and making more accurate decisions. And really, overall better holistic rd, right better research and development processes that are available, because we have all this information available to us. And we can crunch that information in a meaningful way make better decisions for our company. We have a lot of that information that we that big data stack that we saw before, right? Where we have, what we can, how we are able to provide information through things like Hadoop, and we’ll talk about MapReduce later. There’s a lot of other technologies listed there. ETL. So extract transform right load.

So how do we take the raw data that’s being generated by things like smart devices and transform that in a way that is easily searchable queryable and presentable to users in the middle area, right? How do we analyze that data? So things like MaHoot, scipy, hive, a lot of machine learning libraries here statistical software, and then the real business and objectives here the the task here so predictive modeling, sentiment analysis. So how do you detect if a tweet is positive or negative in terms of sentiment? Is this user tweeting negatively about my product or very positively about my project or my product? So these are a lot of the different things that we can achieve using big data techniques and technology to transform that information. A great example of some of these services that businesses can provide are, for example, Google Analytics, and Google Trends.

So Google Trends is actually one of my personal favorites, Google Trends. And I’ll attach this link in Canvas. But Google Trends allows you to actually see what’s trending on Google right as far as who’s searching what. So really interesting thing to look at is year in search. So you can look at what has been trending in the past years. So out of the entire out of all of last year, what was the top five or top 10 things that people actually searched online for? You can look up things like different holidays, politics, gaming, music, movies, you name it, you can actually find and compare who was being searched the most for on Google and the previous year, or even right now. And if you think about the number of sheer number of people that have actually used that use Google, if we look at last year, 2019 3.8 million Google searches per second or per minute, so there every 60 seconds, 3.8 million searches on Google. So how do we actually transform that into something that is usable, right, that’s all information that we can actually look at, and rely on and make decisions on what’s currently being popular, and so forth.

Another interesting service of that nature is healthmap.org. Now, healthmap.org is a website that was originally a research project. But what they do is they actually go in and collect all of these different news articles on the web, and consume them and look for outbreaks. So viruses, diseases, things like that. And where are those news articles are actually being written and try to geo locate those outbreaks on on a map in order to track disease outbreaks. Take a look at this. This is also so mapping, taking things like news articles, and trying to map them on a geolocation. This is called the thematic mapping. As I mentioned, this site tracks disease outbreaks mentioned online by location and provides a map showing the current outbreaks it costs the crunchy. And you can also look around the world a little bit as well, I’m just showing a screenshot here of the US. But this screenshot here was back in 2015, when we had a big outbreak of the bird flu. And so you can kind of see, for example, here, these are articles written in Leavenworth County, that identified an outbreak of bird flu in that region. As you can see, the color and the size of the dots kind of indicate the the size of that outbreak and severity of that outbreak. And we can even compare that to current events with the COVID-19 crisis, where we can see a lot of different dots across the US and certain regions and areas that have higher concentrations of COVID-19 cases that have been reported in the news.

So this is a really interesting way to consume news articles in order to track outbreaks of diseases. There’s also been very similar use cases of thematic mapping with Twitter data, specifically Twitter data with tracking the impacts of natural disasters as well. So things like hurricanes, tornadoes, and things and earthquakes and things like that. There’s a lot of different uses of big data as you can imagine. And we’ve only shown a different a couple different ones. A lot of the big topics are listed here, though, with things like topic modeling, where we can analyze text of an article for an for an example to determine what is talking about. So I’m given three articles. One of them is about a sports game. One of them’s about a sports video game, and one of them is about a sports movie. How can a computer to tell the difference between the different topics, right? They’re all about sports. But how do we write an algorithm to determine which article is about the video game, which one is about the actual sport game, and which one is about about the movie. So there’s a lot of different situations that topic modeling can help us solve and actually teaching computers to actually recognize a section of text and what that section of text is actually about because we’re computer understanding natural language Whether it be written, or spoken like natural language processing as extreme and as an extremely difficult task. So that’s what NLP or natural language processing is actually trying to accomplish here out as a computer understand spoken words. We’ve already talked about the analytics and data forecasting.

Now that’s things like the Google Trends, sentiment analysis, and crowdsourcing. We’ve talked about that all as well. So is your brand getting good reviews? Right? How do we tell if a comment or a tweet is positive or negative? Can you? Can you use that to figure out what the problem is? So a lot of companies are really great at using this, others probably not so much. A lot of things with big data are reduced down into information visualization, as well. So how do you make sense of this large amount of data? One way is to visualize it so people can easily understand it. And this isn’t like making a simple xy chart, or graph in algebra or even using Excel. These are large amounts of information. So how do we transform that into a visualization that makes sense and allows us to extract the information that we need out of it, or the interesting information that that is that exists in it. And we’ve also already talked about thematic mapping, where we can map out data items by location or even geolocation and across time to understand what’s going on in the world around us.

HealthMap.org

HealthMap

Google Trends

The Four V's

YouTube Video

Resources

Slides

Video Script

To really understand how big data can be useful, we can look at four different aspects that are usually referred to as the four V’s of big data. These are volume. So how much data there is a variety, how much or how little variety actually exist, velocity, just how fast data is being produced and or received. And veracity. How accurate is that data, there’s also a fifth fee, typically referred to as value. But we won’t actually include that in this particular lecture. But if you are looking at Big Data information online, you might see that fifth V out there called value. But let’s take a deeper dive into each one of these.

So volume of data, as I mentioned, deals with the scale of information. So how much is there. And so I was mentioned, this, and this graphic here is from a few years ago, but they mentioned that by 2020, there will be 40 zettabytes of data or 43 trillion gigabytes. That’s an increase of over 300 times more from 2005. And also 6 billion people have cell phones, which is an insane amounts of cellular devices. And if you think about how many of those are smartphones, which are generating a significant amount of information, right? It’s just amazing the amount of information that we are generating nowadays, especially with so many internet connected devices, things like smartwatches, smart cars, and even semi smart cars that aren’t truly self driving, but have electronics in them that generate information. And things like businesses, though, have increased the sheer amount of data that they’re dealing with now, as well, because 2030 years ago, there really wasn’t the means to actually buy one generate this kind of level of data, as well as store and store that type of information as well. So now we just basically store everything. And overall storage has become significantly cheaper than it has been in the past. And we can, technology has gotten a lot better. And we can store a lot more information in a much smaller amount of space.

Variety of data is really important as well. So what kind of information are we actually generating? And we already saw what happened in a an internet minute earlier in a previous video. So what does that really equate to as well, things like smartwatches, smart devices, internet connected devices, so smartwatches, smartphones, things like your IoT devices inside of your house, internet connected lights, and social media is a huge one, everyone is using some form of social media, whether it be Facebook, Twitter, Instagram, Snapchat, streaming services, YouTube, Twitter, YouTube, Twitch, Netflix, Hulu, all of those sorts of streaming services. And then we even also have health care. So 2030 years ago, healthcare was still all pure pencil and or pen and paper, right? When you went to the doctor’s office, they came in with your file on a clipboard and filled out all of your information. Now, if you go to the doctor’s office, majority of the time, they’re going to come in with a laptop instead of a clipboard. And so all of your information is now digitize your chart, all of your health information is online, or at least digitize instead of being an on physical paper. That’s 150 exabytes. And that was in 2011. Due to our reliance, of technology, and the fact that pretty much everyone is online almost all the time during the day, we’re generating a sheer amount of information in a variety of different contexts. And so that provides a lot more interesting use cases of the Information Network generating online and are in our daily routine.

The third V of big data that we talked about is velocity. So velocity deals with essentially the speed right of how we’re actually retrieving that information. So, for example, the New York Stock Exchange deals with one terabyte of trade information during each trading session, and 20. By 2016, there was an estimated amount of almost 19 billion network connections. So that’s almost two as two and a half connections per person on earth. And if we stop to think about how many devices that you can currently have are currently own, that are connected to the internet and at any given point in time, the lot, right? When I was a kid, we might have had one computer that was connected intermittently to the Internet through a dial up connection. But now I have a smartphone, a smartwatch, Alexa devices that are in my house that are always connected online, my TV has a wireless internet connection, your gaming devices have an internet connection. So the number of things that are connected anymore, the number of connections that you have to the outside world, have increased drastically, and so has the speed of which you have your speed of your internet connection is before right with a dial up internet connection, there’s only so much information I can be exchanged on the internet every second. But now with high speed internet access, many more people are able to generate and consume significant amount of more information than we ever have been able to before. And that’s not just typical devices that we experience. We even have things like cars, right cars are generating the sheer, more more amount of information that they ever have as well. Even if you don’t own a smart car, most modern cars come equipped with far more sensors that are actually can actually transmit back to even the manufacturer to tell them information about your vehicle. Or even just tell you more about your vehicle than what information that you would have had before. The last fee in the last fee that we’ll be talking about today is veracity, which deals with the uncertainty of information. And this is a really important one because we have a significant amount of information. And a lot of business revolves around that information as well. So recommender systems online on Amazon, Google searches, your digital footprint that you leave online, all of that information is consumed in some way shape, or form, whether or not be from the company that you actually use a service for and generated that information. Or you personally as a user for things like your your health, so smartwatches Fitbit trackers, that sort of thing.

Is that data accurate? Is that data valid? That’s where we’re dealing with the veracity or uncertainty of that information. And so this is a really interesting aspect here, one in three business leaders don’t actually trust the information that they use to make valid decisions. A little over a quarter of people done in the survey, were unsure about how much how accurate they’re getting, it actually was. This cost a significant amount of money, as estimated that poor data cost us about three and a half $3.1 trillion a year. So if you think about the number of businesses decisions that are actually made using this data, or the number of services that you use, that rely on big data techniques, or data that has been generated a lot, right, a lot of business revolves around that. And if that data is incorrect, or invalid, that’s money loss. And it can cause a lot of different controversies. And in some cases, and in terms of things like health care, borrows and various other things that can actually cost lives as well. So the veracity or certainty of data is an extremely important aspect of how we actually work with big data, how we store it, how we retrieve it, and how we actually analyze it.

Map Reduce

YouTube Video

Resources

Slides

Video Script

One of the last things I want to talk about here with big data is some of the algorithms that we can actually use to work with this sheer amount of information. So one that I want to highlight here is called MapReduce. Oh, MapReduce is a very well known algorithm and big data realm. And now it’s been, of course, transformed significantly, since its original inception, inception, to handle even larger amounts of information. But the idea here is that we take a very large amount of information, let’s say text, and then we map it to smaller parts, break it out, and then we recombine that and to produce a final result. Okay. And so if you can use this as an example, right, if we’re trying to, let’s say, sort of deck of cards, okay, if I asked if I give you a whole bunch, or if I give you a full deck of cards that is completely shuffled, but I want you to sort it out in numerical order, as well as the suits would actually take you a little bit of time to actually achieve that task. But if I were allow it, if I gave you a deck of cards in a group, well, so let’s say I gave a group of 10 people, one single deck of cards, and I said, sort do the same thing, it will take them significantly less time than it will if I gave just one person a deck of cards to actually achieve that end result.

So that’s the idea of MapReduce, we partition our information out into very small parts. And then each of those small parts has the same task done to it. And once that task has been executed on the small parts, all of the end results are then combined to produce the final results. So let’s take a look at another example here with word count, which is pretty a real classic example of how MapReduce works. So our input here is a very simple section of texts. So a bunch of different words, dear bear, river, car, car, river, deer car bear. And so you can imagine this being a very large book or something like that. And we want to count the count the occurrences of each word in our in our data set. So first thing that we do here, let’s split this data out. So let’s say that each line of text here is our initial split. So deer, Bear River, car, car, river, and deer car bear. So we have these three, these three data sets that are that are our big data set has been split into these individual data sets. Where each the key value we have key value pair, where the key is this as a document, the value is the the text that we actually contain. So each of these documents here are then going to be mapped to a task. And our task here is to count the word occurrences.

So in this mapping task, I’m going to map each word to a number. So each word of course, and individualized is only going to occur once. So deer occurs once bear occurs once and river occurs once. The key here is going to be word and the value here is going to be of course, the word count. As you can see down here in this middle example, where we have two cars, that’s okay, because it’s individual tasks, remember, so each car is still going to be one key value pair here. Because the important part actually comes in the next step. And the next few steps here. So we’re actually going to shuffle this out on the shuffling process is going to take care of essentially sorting the result of our mapping process. Because once it’s actually sorted, it’s a lot easier to easier to actually reduce and combine. So when we actually shuffle all of the bears and get put in one bin, all the cars get put in one bin, all the deer get put in one bin and all of the rivers get put in one bin.

And then all that happens here is actually the reducing so we actually combining one more step actually combining the information. So we sum the word counts. So bear occurred twice, Parker three, deer two and river two. So we’ve taken all of the individual words here, counted them and sort of the mount and summed them. And then the final reduce phase is we’ve combined this all back into a single list, where the key is the word and the value is the total word count over the entire day. To set. But you can imagine this to be significantly faster than having a one single process or one single out or one single computer doing this, we can use this on things like balcatta, a distributed computer system where we can throw a split the data set up and onto a lot of different processors and have each processor each thread actually execute the mapping, shuffling and reducing task. And then they all come in back together at the end to form the final results. But this is just one of the big data algorithms out there. There’s obviously a significant amount of other types of techniques and algorithms and tasks out there, that big data can actually accomplish. So we’ve just scratched the surface here. But if you’re interested in learning more, please reach out and we happy to actually connect to you with more resources.

Big Data

Subsections of Big Data

Hans Rosling's 200 Countries, 200 Years, 4 Minutes - The Joy of Stats - BBC Four

Introduction

Resources

Video Script

HealthMap.org

Google Trends

The Four V's

Resources

Video Script

Map Reduce

Resources

Video Script