Introduction

Resources

Slides

Video Script

Welcome back everyone. In this video series, we’re going to be taking a look at big data. So exactly how big is big data? So here’s a slide of the metric prefixes. And so which ones do you think constitute as big data? or big data? Today, we’re talking about data anywhere from gigabytes to petabytes and beyond. And the future, we may even be dealing with exabytes of data. I said, gigabytes, right. But I can have a single video or even in these lecture series, that might be a gigabytes worth of data. And in the grand scheme of things, right, that’s just one video. But if you break down that gigabyte worth of data, you can actually start to chunk that up into lots of different lots of different parts, right? If we’re recording in 4k, for example, as a significant number of frames per second in the video, along with a significant number of pixels. And if you tack on the sound information with that as well. It is a significant amount of information that can be packed into a single gigabyte, big data is going to deal with or one, how do we actually store all of this information? How do we use this information? And how do we get information out of it, right, because we can have lots of information. But without any algorithms or way or ways of actually presenting that information becomes pretty much useless at the end of the day.

So where is all of this data coming from? Right? A lot of what we actually come across in the current age is web 2.0 stuff, right? Social media, and video streaming services as well. So if we look at 2017 versus 2018, we can actually see a lot of different changes here. Specifically, right, we have 4.1 million videos and YouTube, versus 4.3 million videos watching YouTube and within the second year, so not a huge jump. But if we take a look at Netflix, the popularity of Netflix is has drastically increased from 2017 to 2018, from 70,000 hours, to 266,000 hours of video watched every single minute on the internet. And likewise with things like Snapchat, Twitter, Facebook, emails, all sorts of different things. And so this slowly evolves over time, as we even get to some of the older stuff, right from 2016 to 2019, which is previous year. So we went from 2.7 8 million views on YouTube to four and a half million on YouTube, and 700,000 pages on or 700,000 logins on Facebook to 1 million, the number of Google searches has drastically increased. So 2.4 million to 3.8 million Netflix, right, we’re up to almost 700,000 hours of Netflix wash per minute. And things like online shopping have also drastically increase sharing of pictures, things like Instagram, Snapchat, and music, right? Especially smart home devices as well. And even things like twitch are starting to gain significant in popularity as a mainstream streaming service as well. You kind of imagine the sheer amount of information that streams online every 60 seconds is quite mind boggling.

How do we actually make infrastructure that can support the scale of streaming of that information, live and in High Definition? Or in a quick manner, right? Because we don’t want to have to sit here and wait five hours in order to download a YouTube video like what we’d had to do in the late 90s, early 2000s things of that nature. So how do we create an infrastructure that can support this type of information? And how do we create software that allows a normal user to interact with that data in a meaningful way.

And that’s really where the Big Data stack comes into play. We can use all this all these tools and techniques not only to store information, but also also to view that information. This technique creates somewhat of a stacked approach here and management of the data where the bottom layer is going to represent all the providers of that information. And if we look here, we have things like MySQL, Postgres, Hadoop, all of these sorts of things, our databases, database technology, and some of them things like Hadoop are specialized in big data. So doing performing operations on very large amounts of information very quickly, and so We have speed on one access and scale on the other. So how fast is that particular technology at working with information? And how much information can it work with per second. So we have megabytes to petabytes, and batch processing, meaning we can process stuff on demand, but it takes a little bit of time for us to actually do it. And then real time, meaning that when we ask for our data, our data is there instantaneously. And we do not and we don’t have to actually wait for it to be produced.

In the middle here on our stack, this is going to be where all the analysis of that information is, at the bottom is where the data is actually being stored and provided into the analysis layer, where we use packages, like things like SciPy, and NumPy, which are Python, Python packages for scientific analysis, also things like mo hoot, which is a machine learning library, and all sorts of other information processing libraries that live at this layer. And this is where we’re going to do some things like machine learning artificial intelligence, we’re going to crunch that data transform that data to prepare it to be displayed to the user, which is done at the service layer.

Now the service layer at the top is where we all primarily interact with online. So when we go to Amazon, and we see those recommended items, or even if we open up Google News, or even Flipboard, or Pinterest, or whatever, those user curated websites and mobile apps, where we have our news articles that are presented to us, and a lot of those are curated for your viewing interest and reading interests. And so how do we take all of the data all of the news articles in the entire world and collapse them to a smaller bite of information in a smaller chunk bite sized chunks that you can consume as a user? How do you consume the data that you want to consume? And so that’s where these services come into play things like news curation, even weather forecasting, pricing of items online recommendation of items online, and even things like online reputation. So how do you gain reputation or fame on the internet for SNAP something that’s trending on YouTube, or Twitter, or Instagram or something like that.

But where is all this going? So we talked about some of the services that are provided using big data. But a lot of this is primarily used for business, right? We consume as users a significant amount of this information. But underneath the hood, a lot of this is being driven by business objectives as well. So customization of services that that everyone provides. So anymore, right with web 2.0, everything that we use online is a tailored service to you as the user. So your login information, if you have it on Google that is being tracked, right and all of that information that you do would online searches, websites that you visit, you have that digital footprint, that digital footprint allows websites and companies to provide a tailored service to you specifically, also allows companies to help react to certain market trends a lot easier. So what’s trending at the moment. And they can make business decisions on that information. They real time optimizations for identifying certain costs and making more accurate decisions. And really, overall better holistic rd, right better research and development processes that are available, because we have all this information available to us. And we can crunch that information in a meaningful way make better decisions for our company. We have a lot of that information that we that big data stack that we saw before, right? Where we have, what we can, how we are able to provide information through things like Hadoop, and we’ll talk about MapReduce later. There’s a lot of other technologies listed there. ETL. So extract transform right load.

So how do we take the raw data that’s being generated by things like smart devices and transform that in a way that is easily searchable queryable and presentable to users in the middle area, right? How do we analyze that data? So things like Mahout, SciPy, Hive, a lot of machine learning libraries here statistical software, and then the real business and objectives here the the task here so predictive modeling, sentiment analysis. So how do you detect if a tweet is positive or negative in terms of sentiment? Is this user tweeting negatively about my product or very positively about my project or my product? So these are a lot of the different things that we can achieve using big data techniques and technology to transform that information. A great example of some of these services that businesses can provide are, for example, Google Analytics, and Google Trends.

So Google Trends is actually one of my personal favorites, Google Trends. And I’ll attach this link in Canvas. But Google Trends allows you to actually see what’s trending on Google right as far as who’s searching what. So really interesting thing to look at is year in search. So you can look at what has been trending in the past years. So out of the entire out of all of last year, what was the top five or top 10 things that people actually searched online for? You can look up things like different holidays, politics, gaming, music, movies, you name it, you can actually find and compare who was being searched the most for on Google and the previous year, or even right now. And if you think about the number of sheer number of people that have actually used that use Google, if we look at last year, 2019 3.8 million Google searches per second or per minute, so there every 60 seconds, 3.8 million searches on Google. So how do we actually transform that into something that is usable, right, that’s all information that we can actually look at, and rely on and make decisions on what’s currently being popular, and so forth.

Another interesting service of that nature is healthmap.org. Now, healthmap.org is a website that was originally a research project. But what they do is they actually go in and collect all of these different news articles on the web, and consume them and look for outbreaks. So viruses, diseases, things like that. And where are those news articles are actually being written and try to geo locate those outbreaks on on a map in order to track disease outbreaks. Take a look at this. This is also so mapping, taking things like news articles, and trying to map them on a geolocation. This is called the thematic mapping. As I mentioned, this site tracks disease outbreaks mentioned online by location and provides a map showing the current outbreaks it costs the crunchy. And you can also look around the world a little bit as well, I’m just showing a screenshot here of the US. But this screenshot here was back in 2015, when we had a big outbreak of the bird flu. And so you can kind of see, for example, here, these are articles written in Leavenworth County, that identified an outbreak of bird flu in that region. As you can see, the color and the size of the dots kind of indicate the the size of that outbreak and severity of that outbreak. And we can even compare that to current events with the COVID-19 crisis, where we can see a lot of different dots across the US and certain regions and areas that have higher concentrations of COVID-19 cases that have been reported in the news.

So this is a really interesting way to consume news articles in order to track outbreaks of diseases. There’s also been very similar use cases of thematic mapping with Twitter data, specifically Twitter data with tracking the impacts of natural disasters as well. So things like hurricanes, tornadoes, and things and earthquakes and things like that. There’s a lot of different uses of big data as you can imagine. And we’ve only shown a different a couple different ones. A lot of the big topics are listed here, though, with things like topic modeling, where we can analyze text of an article for an for an example to determine what is talking about. So I’m given three articles. One of them is about a sports game. One of them’s about a sports video game, and one of them is about a sports movie. How can a computer to tell the difference between the different topics, right? They’re all about sports. But how do we write an algorithm to determine which article is about the video game, which one is about the actual sport game, and which one is about about the movie. So there’s a lot of different situations that topic modeling can help us solve and actually teaching computers to actually recognize a section of text and what that section of text is actually about because we’re computer understanding natural language Whether it be written, or spoken like natural language processing as extreme and as an extremely difficult task. So that’s what NLP or natural language processing is actually trying to accomplish here out as a computer understand spoken words. We’ve already talked about the analytics and data forecasting.

Now that’s things like the Google Trends, sentiment analysis, and crowdsourcing. We’ve talked about that all as well. So is your brand getting good reviews? Right? How do we tell if a comment or a tweet is positive or negative? Can you? Can you use that to figure out what the problem is? So a lot of companies are really great at using this, others probably not so much. A lot of things with big data are reduced down into information visualization, as well. So how do you make sense of this large amount of data? One way is to visualize it so people can easily understand it. And this isn’t like making a simple xy chart, or graph in algebra or even using Excel. These are large amounts of information. So how do we transform that into a visualization that makes sense and allows us to extract the information that we need out of it, or the interesting information that that is that exists in it. And we’ve also already talked about thematic mapping, where we can map out data items by location or even geolocation and across time to understand what’s going on in the world around us.