Resources

Video Script

Welcome back. In this video, we’re going to be taking a deeper look into what kind of queries we can actually ask of Google or searching the web. So we’ve already looked at simple queries like cat, dog, or cat and dog, or cat or dog. But that’s really not as interesting of a question to ask of the internet, right? When we’re wanting to find all of the webpages that are about cats, not just have the occurrences of the word cat, but are about cats. And so how do you know which web pages are more likely to be about cats and webpages that are more likely to be about dogs? And how could you tell the difference? And how would you modify our search algorithm that we had talked about earlier to account for that. So this ranking of the internet becomes a little bit more complicated than our simple straight up text queries.

But as we learned in the HTML lectures, pages in the world wide web have structure to them in the form of HTML tags. And so we could add more information about those tags called metawords to our indexing algorithm. And so now we could could re query the index to find specific words in. for example, the title. So in theory, right at a very simple explanation is a article that contained the word cat in the title is probably more likely to be about cats than it is dogs, even though dog was found on that webpage somewhere in the text. But again, this is a very simple approach, right? The metawords help our algorithm, help our searching, but it’s not going to be a pure ranking system, right? Because basing our questions and answers off of just the text, and these tags won’t yield very efficient results. And it won’t yield a very good collection or a very good curated, top 10 list of dogs? And how do we rank them in priority from our search? Because obviously, there’s a lot of cats and dogs on the internet. So how does one webpage about cats come first, early web search engines dealt with. So as the web got larger and larger, it was more and more difficult to find good pages in all of the bad or useless ones.

One of the early search engines was actually called AltaVista, it grew very quickly at first. So in 1996, it had five servers, 210 gigs of search engine that was constantly indexing the web, late 90s. So having 500 gigs of storage, and 130 gigs of RAM was big, right? This was a huge, huge set of servers, and very powerful machine in order to index web. And it was able to handle 13 million queries daily. So considering this the late 90s, Internet was still relatively new, but this was also getting very close to the .com era. But as we mature into the .com era, Alta Vista was eclipsed by another. So Yahoo, came in and bought them out in 2003, and then eventually shut them down in 2011. But Alta Vista was one of the first major successful search engines that we had for the internet.

We’ll transition into technology that was developed and published in a paper called the Anatomy of a Large-Scale Hypertextual Web Search Engine that provided the answer to our question of how do we rank the internet. They created a new algorithm called PageRank, named after Page himself, that could rank search engine results based off of their authority of the page that was actually found But let’s see how this actually works.

So, as you know, the worldwide web is interconnected ith hyperlinks. So if we had two hypothetical web pages here, Ernie’s scrambled egg recipe and Bert’s scrambled egg recipe. And we could analyze those links that are going o and from that page to learn more about it. With our previous indexing algorithm, if we found eggs on both of these pages, so they’re technically about eggs, but which one should come first. Should Ernie’s come first in our search results, or Bert’s? Which one of these do you think would be the top result for a scrambled egg recipe and why? So in theory, right, Ernie’s scrambled recipe, even if it is better than Bert’s, Ernie s recipe only has one other web page that links to it. So this could be a comment, it could be another blogger, whatever it may be, maybe a Facebook post. But there’s only one person that has a link to that webpage. But Bert’s recipe is obviously more popular, or at least more well known, because it has three pages that link to it. So according to the page rank, here, Bert’s recipe has more authority about scrambled eggs, then Ernie as on scrambled eggs. So Bert’s would come first in our search result and Ernie’s would come next, if these were only two web pages about scrambled eggs. But unfortunately, the World Wide Web is not as nicely structured as we like. So it is highly dependent on the actual websites themselves and other things that become a lot more difficult to actually search through.

But what about this particular example, which page would be ranked higher? Well, we still have Ernie’s and Bert’s scrambled egg recipe, we have John and Alice here. John’s web page points to Ernie’s and Alice’s points to Bert’s. Ernie’s recipe only tried it once it’s not too bad. Alice is clearly a lot more excited about Bert’s recipe. Bert’s recipe is clearly one of the best recipes. So if we are just basing this off of the text, Alice has more authority or would have more authority than John, if we’re looking at the sentiment or the excitement here. But this becomes a problem if we’re just looking at links, though. Right? One link here, one link here. Which one comes first? How do we break that tie? Well, let’s take a look at it as far as the authority of the pages that actually link to it. So John and Alice both have a link to Ernie and Bert’s scrambled egg recipe.

So that’s one and one for both Ernie and Bert. But John is a new blogger; John doesn’t have very many followers so John does not have very much authority. So John has only two links pointing to his web page. But Alice is a really popular blogger. And so she has a lot of different pages, maybe she has 100 different pages that are pointing to that page. And so Alice ends up having more authority. And since Alice has more authority, a link from Alice to Bert’s scrambled egg recipe has far more weight than John linking to Ernie’s scrambled egg recipe. And so in that sense, Bert’s recipe would still come first in our page ranking algorithm over Ernie’s scrambled egg recipe because Alice has more weight, more importance, more authority than John’s page, even though Ernie and Bert have the same number of links to their websites.

But sadly, again, right the internet is not a nice tree structure like that either. Sometimes it’s full of these cyclic connections like this. So if we follow for example, A to B to E to A to B, E A to B to E, we get stuck in these cycles. And these cycles are what we call syncs. Okay, so they suck up all of the authority of the internet. Because our algorithm gets stuck So if our indexer is just sitting there out there on the eb following links, it’ll just et stuck in this cycle. It’ll follow that link over that cycle over and over and over and ver and over and over again. And so all of the authority gets consumed by that cycle. So we can’t just simply follow links on the web because we’ll get stuck in those an all other web pages will have no authority whatsoever.

So one of the ways that actually Page and Brin suggested is called this random surfer model. So imagining that you are simulating the random surfer, you’re going to start on some web page, totally at random on the internet, and then just start clicking on random links. And you’ll continue on that pattern until at some random point in the future, you teleport to a new page and start surfing again. So start on a page, start clicking links on that web page, and then, and then you’ll eventually get ported off into another web page. And then you’ll start this process over again. And so instead of just selecting each link, in a specific order on a web page, you’re going to randomly select links to actually follow, this will cause you to break out of those cycles. And if you do this enough time, keeping track of the percentage of visits that you made onto a particular web page, you can use those values as the rank of the page. So the randomness here, the more links that you have to a page, theoretically speaking, you are more likely to be randomly surfed to others.

So let’s take a look at a example of the random surfer w th PageRank. So here is our sample web that we’re actually crawling here. So we have site A goes to home has an about page, product page, and more on the product page, links to another we site. So site A has these four pages, and then the product page on site a links to site B. Now, let’s say we are just rolling a set of dice, and each of these web pages has a link and that die, the probability the number on that die will would have you either follow that link, or stay on that webpage, right. If multiple links on were, the number that was rolled on the die, depends on which link you actually follow. If you keep on randomly rolling the die, and following links here, which page actually has more authority in this particular example? So if we kept on doing our random page search, so let’s see, if we started at S te A, and we went to home, we could have, we have three different options here. So three outgoing links, and three incoming links. And same thing for about has one, two, three outgoing and three incoming, and so does the more web page. So it has three outgoing and three incoming.

A couple differences here: home has four incoming links, and product has four outgoing links. And so if we keep on randomly rolling our dice here, this is what the authority would actually look like. So if we went long enough, the about and more page would be identical in authority because due to probability, they’re equally as probable to be visited as the other because they bo h have three outgoing links, and three incoming links. The product page has a little bit more authority than the about and more page because there are three or four outgoing links, and only three incoming links. So they have a few extra inks on here that it can actually follow even though it causes the product page to go to site B. Site B only has one possible way of actually being visited, and so that has the smallest authority. But the homepage has the highest authority among all of the pages on site a because it has four incoming links and three outgoing links. So probability wise, it is more likely to be visited than all of the other pages here. And that’s the idea of the random surfer model, even though we have a cycle here. So from home, the more the product about home more product about. Even though we have cycles, the random surfer model is going to narrow down that cycle regardless of the cycle and narrow down which web page is actually more likely to be visited based off of the links that actually finds.

Another way you can do this is by simply generating all possible paths that can be found on the internet. And you can calculate the length of those paths and the percentage of the times that those particular pages show up in those paths. So if we calculate all the different paths that we can actually find here that are less than five, so five sites deep, we actually find that site a occurs more frequently in those past than all others. So this is a little bit different than the random surfer model. So this is a little bit more discreet, meaning that we have a little bit more of an order to things instead of everything being completely random. It does help solve the random surfer model, although we do have to limit the length of our path. So there is some downsides to this technique.

But of course, there’s a lot of different problems out there that could get in our way of actually searching data, right? Instead of just those cycles, right? We have a lot of other issues like spam. How do we take into account our all those ads that are on the internet that are just basically plugging the internet, although that’s, you know, how a lot of people make their money online. But how do you avoid spam? How do you avoid ads that you don’t want to search? How do you avoid malicious websites? You don’t want a malicious website to be the number one hit on Google. How do you take into account the size of the internet and reduce computation time? How do you search non textual information? Right? We have a lot of images and videos on the internet now. And audio, how do you take that into account? Structured versus unstructured data and a lot of lots of different other things. And that brings us really into the difference between what we had with web 1.0 to web 2.0. This has a such a major impact on how the world wide web works. And so after the .com bubble burst, no one really knew where the internet was headed. And a lot of different talks were being had about what the internet could be, or what it could be different, right?

Because at the time, the internet was a place where companies just push content to their users. But now the internet is a place where users can communicate and share their own content quickly and easily. And eventually, that led to this idea of web 2.0, which we’ve talked about before. But the reason that we bring this back up is that the web is a dynamic place now or the web is a more complicated place now. We have web 1.0. We had static web pages that didn’t change very often. And so that led to search algorithms that were significant easier to do because we were dependent mostly on text. And now we have dynamic web pages where the web page that I visit is entirely different than the webpage you visit even if it’s the same URL, because it depends on who’s logged in. It depends on your browsing history. Now we’re also looking at user generated content. So if you look at a Reddit post, right, Reddit trying to search Reddit, how does that show up in Google? Because new Reddit posts are made every single second. So how do you take that into account? How do you take into account multimedia pages, so things like YouTube videos, audio, songs, images, pictures of cats versus pictures of dogs when you’re trying to search that versus text. And this is an entirely different web that we live in now versus what we had in the 90s when early search engines were about. And so this is something that is going to be a continuous problem as we kind of move forward and as the web grows, but it’s just kind of highlighting the importance of this sophistication of the search engines that we are currently able to use.