Relational Data

Resources

Slides

Video Script

Welcome back everyone. In this video, we’re going to be taking a look at information retrieval. But before we get to that point, we need to start talking about databases and how data is actually stored. So for this example, we’re going to be working on our own social network called K-Stater. Specifically, we’ve been asked to help design a way to keep track of all the data on a site, like a social network, like K-Stater should have. So if you take a few minutes and brainstorm what kind of data or imagine what kind of information we may need, for example, of course, we’ll need to have some sort of way to track our users like maybe your eID, name, birthday, major, and maybe even just your phone number. And, you know, of course, we’ll need a lot of other information to make a full website like this work. But we can start off with the basics right, just keeping track of our users. But since we want to keep track of all that information, we need some way to actually store it. And so there’s tons of different ways that we could actually store that on our computers, right, of how our computer systems are designed to store information in that way.

But thankfully, there was a huge change in how the data on our computers are actually stored, or at least how we even look at data stored on our computer. And that’s mostly thanks to this man here, Edgar F. Codd. While working for IBM in the late 1960s and early 70s, Codd began working on the ideas of relating to the storage and arrangement of data in computer systems. And in 1970, he published a paper called a relational model of data for large shared data banks that laid out this grand idea for a better way of storing data. That idea led to the creation of what we call a relational database. This is a form of database that is by far the most common in the world today. And if you’ve ever used things like Microsoft Access, or MySQL, or something like that, Microsoft Access is a simple type of relational database. That is pretty easy to use for most people just like what you would open up and use with Microsoft Word or even Excel, and Excel to some extent, depending on how you have your sheet set up is like a relational database, or at least relational data in some sense.

The idea behind a relational database revolves around three different terms, a relation, a tuple, and an attribute. A row in a table is called a tuple. It represents a single item that is stored in the database. In a simple example each person stored in our database could be a just a single tuple. Right, or if we are storing addresses for our users, so that address itself would be one tuple, or one row in our database. Each row or tuple has many attributes represented by columns in the table. An attribute could be something like a name, address, or like a street, zip code, phone number, one single piece of information that is associated with that one particular row or record in. The table itself is called a relation because it relates different attributes together uniquely, in order to describe objects as tuples. So our relation as a whole, right, a relational database. So each attribute strung together in a single tuple is like information related information, right? So we have we have a user, our user may have a name, a username, an email, and all sorts of different types of information. Each of those attributes are related to each other, or in an ideal situation of how we design our database. But relating those different attributes together is really kind of what we are, what we benefit, or what we get out of something like relational database, because it makes our data much easier to store because we store like information with like information, right, so all of those records that have similar information, like all of the user information, and it makes it a lot easier to store and search. Especially in a modern age where we have such large amounts of information, having this highly structured data makes it significantly easier to work with and use as you utilize apps like Facebook and Twitter.

But let’s do some hands on work with a relational database. So since we’re creating a social networking site, probably one of the first relations we’ll need to create is for the user. Right, because a social network is pretty much useless without some people to actually use it. So here’s one potential way that we could set that up. The data here is all nicely arranged into rows and columns, and in theory, right should be easy to look things up. Maybe not. Right? At least the way that I have it currently. So what if you wanted to find all of the students here in our in our user database, or our user relation here that are majoring in computer science? How would I go about doing that? Take a moment here to pause the video and take a look at a table here and see if you can find all of the users or come up with a way to find all of the users who majored in computer science. Hopefully you found all of the users there’s that do did majored in computer science is pretty straightforward, right? But for us humans, and for the small number of users, but for a computer to search our table here to find all of the computer science majors, it becomes a little bit more complicated. Obvious, my name is in there, Josh, first row there. My major is comp. sci, which is short for computer science. And we don’t need the information science or Information Systems majors there.

But we have another computer science major there, gameguy down there, his major is computersci, all one word. And so while both of those users are computer science majors, the listing or how can the computer science major is actually listed as entirely different. That makes the solution for a computer much more difficult to actually write because now we have to take into account in this situation, all of the different spellings and arrangements of the computer science major. So this particular relational table or this particular relation, we would refer to as not normalized so the data isn’t normalized, the data isn’t uniform. And so why do we want to normalize the data? Well, big obvious reason that we just experienced is normalized data makes it a lot easier to actually work with. So those data anomalies, right, the differentiation between compsci, CS, computersci, computer science, all of those differentiations, they’ll mean the same thing. But it becomes a lot harder and more difficult to use, and it makes it more prone to error as well. It also makes redesigning your information a lot easier. More often than not, when you’re actually working with an application or designing an application that’s working with a database, you’ll end up finding that your use cases change over time. And so more often than not, you have to go back in and redesign your database or redesign your tables in order to add new columns or attributes or entirely new relationships.

And so having normalized data to begin with makes that task significantly easier because you don’t have any data anomalies in an ideal world that can throw a wrench into your plan. But one of the other things that normalizing our data can actually accomplish is mirroring real world concepts. So we don’t want to store information into our relation in a random computerized way. Because that’s not the real goal that we’re trying to accomplish with these databases. The goal here is to store information that is significantly easier not just for the computer to actually search through, but also humans as well. And so we want to be able to store our relations in a way that mimic what we would normally store that information in real life. For example, addresses and things like that we don’t have need to come up with these crazy elaborate ways to actually store an address. But we want to make sure that it’s stored in a way that makes sense. Of course, and lastly, this also simplifies the search queries that we actually do on our relations or our data. So those that question that I asked of find all the users that majored in computer science is a lot easier if all of the computer science majors have the exact same spelling of the computer science major. And so having normalized information and normalized data also makes our search tasks significantly easier. So let’s take a look at a way that we can actually start to fix our unnormalized relation.

So first off, let’s take a look at the major column because that’s what we encountered first, and that’s what we tried to search first. So what can we do to make sure that the data in that column is easy to search through or easy to query? Now, if you pause to kind of think of a few ideas here, one potential way that I have listed here is to add an ID for each major. And so essentially, what we end up having here is a second relation. So we split the major off into a second table or a second relation, that acts as a lookup table. Okay, so if you imagine using your basic form online, when you’re signing up for stuff, a lot of times you’ll encounter drop down boxes, right? Those act like simple lookup tables, right, you are given a specific set of things to choose from, instead of typing in free text. Free text, when we’re talking about storing things inside of a table, if we want that free text to be uniform, we can’t rely on our users to enter that for us, we have to suggest and complete that and give them the options to actually select from. But since we have a separate table, here, we need something that’s called a primary key that identifies each record uniquely so we can include it in the other table. It also makes it easier to search.

Each relation in your relational database should have a primary key. And that primary key uniquely identifies each record. So each record in a table or an in relation should be able to be uniquely identified by any number of attributes. So a primary key does not have to be one single attribute. But it can be multiple attributes combined together that are unique. And without that uniqueness, there’s no possible way for us to uniquely retrieve any single record from our database. So that particular part is extremely important as far as relational database goes. What we have here in our new table is our major ID. And you can actually use that to look up the actual abbreviation or the actual major itself. So major ID in this table, this is our major table now. This column is going to be the primary key. And over here, we can just assume that our user ID is our primary key, because everyone at K-State has a unique eID. So we can safely assume that that is our primary key that uniquely identifies each user. But instead of having the major column here in our user table, we now have major ID. And so when a user actually selects their major, you can imagine they would be selecting this through a drop down box. But that is replaced not with the actual major, but the ID of that major.

So if we wanted to see what major John was, we take the major ID 3, look that up in our major table. And that tells us that John is an Information Systems major. And that information systems major has an abbreviation of IS. But this makes it significantly easier for us to look up all of the computer science majors because all we need to do is look up find computer science in our major table, what is that ID and then search that ID in our user table. And so that process is made significantly easier now. But let’s also talk about the birthday column, if we can think of different ideas to how to represent that. Because if you look over here, on our birthday column, we have a whole bunch of different variety of ways that we have for our birthday. So June 13th, June 1, February 2nd, Dec 26, Dec. 18th with a period 18th. So we have abbreviations for the months, days, with the th, nd. So there’s lots of different variations here, we and we don’t even have just the pure numbers yet, which you assume that people would add as well. So there is no uniform way of representing the birthday now. But what we could do is add that to our interface, right, we don’t have to necessarily purely rely on our database, although we could enforce a specific format for our column on the database side. But we can also enforce that when the date is actually entered. So we can have like a little calendar picker or something like that, or an algorithm actually checks the text before it’s actually saved. So just something to watch out for when you’re working with information.

It is a very good practice to keep our data normalized, to keep everything not just unique but evenly distributed or even format street same format for everything because if a human’s going through here and reading this, of course, we could also we can all see what the birthday is and understand what each of these mean. But when we’re querying our database or if our algorithm or computers looking at these records, it becomes significantly more difficult to actually understand. But let’s look at another example. What happens if we try to add a phone number to our user table? Now I’ve stripped off some of the other columns here just to make this example a little bit easier to see and fit onto one slide.

But let’s try to add our phone number. So but how many phone numbers does the typical person have these days? Well, landlines really aren’t that that common anymore, but people still have them. I have one in my office. But I also have, you know, my cell phone, I have a digital telephone number as well, and a couple others, right. So someone might have multiple, just even multiple wireless or cell phone numbers without even considering a landline. And so adding multiple phone numbers for a particular user can become a little bit difficult to actually do. And so if we try to just simply add, say, phone number one, phone number two, things get complicated really quickly, because if someone has more than one phone number, we had to add another column for their second one and another one for their third and so on. And so that doesn’t really become a very good practice. Because, in reality, we don’t want to have to add another call to our database or to our table every time we need to add a an extra phone number for a particular user. So like what we did before, let’s try to put that into a another table, its own table. So one of the ways that we could do this is just have a phone number table. and in this situation, Our phone number is going to be the unique or primary key. And then we have the users that connect the relationships together. So we have a we have many users, right to one phone number.

And this works, but still really doesn’t solve our problems, does it? We still have the same issue before, instead of having having to add multiple phone numbers, multiple phone columns, we now have to add multiple user columns, because right, what if an entire family signs up for our social network? That family might share a phone number, so each person in that family is going to have the same phone number. So every time another person from that family signs up, we’re going to have to add another user to that phone number. So we end up with pretty much the same problem that we did before. So if we flipped it again, right, we can still have our phone number table. But let’s flip that. What if we use the user ID for the primary key in both situations, so primary key over here is our user, right? But we’re going to connect that to a user in our phone table. But now instead of just having only the user ID as our primary key, we have the user ID and the phone number as our primary key.

So if we use both of those to uniquely identify a phone record, that becomes a little bit more easier to do, because we don’t duplicate any information here. Although, of course, we do duplicate the user ID down here, which is perfectly fine. But this makes it more easy to search, right. So this is a one to many type relationship, where we have one user too many phone numbers. So we have one user here, but many phone numbers here. And so that describes the relationship between these two tables. And so we can see it over here. Since the user ID and the phone number are unique together, we can have the same phone number for multiple users. So we have phone number 5134. So gameguy has two phone numbers, but Sharpie also shares the same number as gameguy here. Now this is just a simple example of how we might store a large amount of information or related information into a database that is easily retrievable by a computer.