Chapter 9

Obtaining Good Design

Subsections of Obtaining Good Design

Database Design Principles

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be starting our discussion on what makes a database a good database. We’ve talked a lot about so far how we might make a SQL query to retrieve data. And later, we’ll also talk about SQL queries that we can use to actually insert and modify and delete data as well. But what makes a database design a good database. So there’s a couple primary things that we’re after as part of a good design. And we’ve talked about some of these items, when we talked about the motivation for why we need a database over something like an Excel sheet, for example. But data integrity is one of the big ones, right? So the data that we actually store is, is sent, received and stored in the same way that the users are actually translating it as so you know, if we need to store some number, or string or whatever it may be inside of our database, that same number that is transmitted is stored. And then when we retrieve that data, that same data is actually retrieved as well, it didn’t actually change or get corrupted or, or anything like that. But the big point behind data integrity is that we’re trying to avoid unintentional changes of our data, whether that be in transmission, writing, or retrieving our information that we’re actually working with. But that also also kind of helps us with maintaining our data as well. So this includes anything, any data that gets updated, deleted, all those sorts of things, and also the maintenance of our code. A poorly designed database is going to typically impede our ability to maintain a consistent state of information, and especially maintaining that consistent state across a larger period of time. poorly designed databases often include multiple copies of the same information, and summon, some copies get updated, some don’t, some get deleted, some don’t. And a variety of other things in between. And I’ll show some examples of this here in just a little bit.

But that also helps us write better code as well, allowing us to maintain good quality SQL code that we write as part of our database alongside application code as well that utilizes the results from that database. This also helps translate usually to better performance, a well designed database will typically take up less storage, especially with a database that doesn’t store duplicate information. Of course, less duplicate information means that we’re having to store the same information only once, rather than multiple times. So that usually translates to better performance as far as storage have better efficiency for storage. And typically, better performance on SQL queries. Although sometimes, we do sacrifice some performance to improve the footprint or storage of our database. But again, it really depends on the use case. Here, we talked about good design that provides data integrity, maintenance of that data, as well as better code as a result for this class. Overall, we’re going to talk about a couple primary, a couple primary design principles that we want to follow. And these are of course, covered at a very high level. And we’ll have some more videos that cover these in a little bit more detail. But these are some of the things that you’ll want to strive for as you’re designing your databases. But the first one here is to avoid unnecessary complexity. Sometimes when we’re working with data and information, it can be a big pitfall to overcomplicate things. So sometimes we make extra tables when we don’t need extra tables, adding more columns, and we don’t need more columns, and a variety of things in between. But unnecessary complexity, of course, also increase in our database also increases the complexity of our SQL that we have to write in turn complicating our code that we have to write, which makes your your program your programming your SQL code, and the program that uses it more susceptible to bugs and issues down the road. This also can affect our data integrity as a result.

So unnecessary complexity is a kind of a difficult one to avoid because it’s not always apparent that you’re going down a road that is more complex than others. But this does come with practice and we’ll see some examples of this as we start designing more databases, our other general design principle that we’re going to focus on here is avoiding redundancy. Most of the time, this just equates to not storing the same data more than once. So this is going to be kind of our mantra as we start going through writing our databases. So I’ll show I’ll, I’ll show this in my upcoming example here. But this is also something that you should consider following when you work on your final projects and go out into the real world and start designing databases as well. This is probably the most common issue that I actually see in database designs, not only in industry, but especially for new database engineers, or designers that come out. So typically, it is easier up front for someone to store the same data in in one column and store that exact same data in another column in another table. But that duplication of data can, again, as we talked about a little bit ago, influence our data integrity, right, the data consistency gets lost, because if one record is updated, but the other is not, then we have an inconsistent state of data of the same kind of record that we’re actually storing. So one of the things that we’re going to really try to focus on here is trying to avoid storing the same information more than once. On occasion, we can’t avoid this, but on more, more often than not, we can avoid this kind of issue. But those are two primary design principles that we’re going to be focusing on as part of this course. And here in the next video, we’ll take a look at an example of the beginning of designing a database

Database Design Example

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be taking a look at an example database design. So we’ve shown some basic database, UML diagrams, but we really haven’t taken the time to draw our own yet. So in this example, we’re going to take a look at designing a really basic database for offering college courses on a campus like K State, a course is going to be defined by its number delivered by a department in a college and offered certain semesters at the same time on various days of the week. So this is going to be the basis that we’re going to start off of start from for drawing our database design. And here, I don’t really give a lot of details here, right, we’re just basically defining a college class, right, like cc 520. Right, CC is essentially the department, which is offered in the Arts and Sciences College. And so 520 is the number and typically right, the CC the department and number is going to be a unique combination within a given college. And so there’s a lot of other core details here that we’re missing as part of this. But let’s take a look at how we might start drawing a database that will store college courses. So what we have here is called Lucid Chart, it is a diagramming tool that can be found online, you can download it as well, there is a free version that I’m using here, it does limit you to the number of documents that you can actually create. But for the most part, you can get around that by adding multiple pages like I have down here to the same document, you can achieve the same kind of functionality through a lot of other programs. One other one that I really like is called draw.io, which is a completely free tool that you can use online to draw things like flowcharts. And in this case, UML diagrams for drawing or designing databases.

So here is a UML diagram that I talked about, for our table that we talked about here, or the course that we talked about here, a course identified by a number department college and then offered some semesters, at the same time at various days of the week. So I don’t quite have all of the detail here yet that we talked about. But I do have some of it, right, so we have a course table. So every every entity that we see over here, and you see there’s a couple of different options here that we can actually use, most of the ones that I’ll be using today are going to be just this entity model here, which allows me to identify the columns that are listed here, and my table. So table name, which is course and may actually be better me to make this a shaded header, it doesn’t quite look as clear. So I’ll leave it unshaded. But we have the table name here. And then all of the columns associated with that table or that that that is in that table. And on the left of each of those column names, I can specify whether or not something is a key, like a primary key foreign key, or maybe that has a unique constraint associated with it. So UK is used to denote a unique constraint. And we can specify a variety of other things, other constraints here as well, if we’d like to, if you wanted to get in more detail, we could do this triple the triple field, where are the triple column table where we have the keys or constraints on the left, the column attribute names here in the middle, and then the column types on the right hand side. And today’s example I’m not going to go into assigning types here. But I will talk about the general difference between some of this. So what I’m focusing on for this example is our logical design of our database. Meaning that what we actually put into into action and implementation like in Microsoft SQL Server will actually be a little bit different to what the actual database design actually looks like.

Because the logical sometimes or Some things, we’re not able to concretely model as part of a logical design, versus what we what we actually implement in the actual database. And I’ll try to highlight some of those differences as we go along here. But for now, we’re not going to really try to define the physical, the physical implementation of our database. So I’m not going to bother with the column types for now. So let’s go ahead and take this out. But we have a single course here, a single table. And obviously, we’re not, we don’t have all of the information that we need yet, like days of the week time, this course is going to start all those sorts of things. And even things like semester, right is very vague of what is actually stored there, like, are we going to store fall 2020, fall 2021, spring 2022, whatever it may be, that’s a lot of little different pieces of information that we have to cram all into one column. And we want to avoid those types of things as well. We don’t want a column that we are forced to store more than one bit of information, even if it is related, it is much more powerful to store the those pieces of information in separate columns, or even in separate tables. So we, for one can have more expressive queries. And two, we can actually have better data integrity, right, we don’t have to have as much redundant data being stored as part. So let’s kind of explore what we can try to start adding to this, to make this a little bit better.

So I’m going to use my little copy up here just so I can change things here. But or to make this go a little bit faster. So let’s first start out by taking out some of the dupe the things that would be duplicated, right, because if a let’s say, you know, in general, right, a department is going to offer more than one course, typically. And if a department offers more than one course, things like the college name, and department name, are going to be duplicated across all of those different instances or all those different rows. And again, we want to avoid duplicated data. I’m not going to touch semester for now, let’s go ahead and focus on college and department. So let’s make a department, a department table. So I’m going to have department here, and I’m going to go ahead and do most of my most of my primary keys, for a lot of this data are going to be surrogate keys, meaning that at some unique identifier, typically it’s going to be some auto incrementing number, that as we increase as we insert new data into our table, these surrogate keys are a little bit more beneficial and easier to work with. We could enforce that, you know, a department name be unique, which we’ll probably do anyways here. But having the primary key based off of that can be tricky at times. So in general, it’s better practice or you’ll see this in practice out in the real world where we have this auto ID field that serves as our primary key. And even in other database implementations like MongoDB, which is a no SQL solution, their primary their default primary key is an auto incrementing ID field that you don’t even have to add as part of the the database design is just exists as part of the database implementation. So let’s go ahead and add a couple more fields here. So we’ll need to do let’s add Name, college and let’s say like phone and email or something like that. So here we have the part met name, let’s do college. And then phone and email. And then we don’t have I’m not going to have a key here quite yet. We’ll expand that in a little bit. But I am going to enforce that the department name be unique so we can’t have to department to department of computer science does have to have some unique, some unique name as part of it. So now what we what can we do with this setup?

Well, it I can actually show a relationship between these two tables. And now instead of actually storing department here, as part of my course, I would actually store department ID. And that department ID now becomes a foreign key. And with that, we would, we can still enforce that this be a unique key as well along with number. So we don’t have the same course number twice, for a particular department. If that is a restriction that you want to add, I’m not going to add as many of the like unique keys restrictions here, I’m going to save that as an additional exercise for you to try on your own. But just something to consider. So now we actually have, we don’t have as much duplicate information, along with the department now. So we don’t have duplicate information for the department. But we’re still duplicating information for the college because you can see here, I have college, for the course. And then I also have college for the department. And so it really doesn’t make sense to keep college as part of our course. Because since it belongs, since it is associated with the department, then we can take that out of the association with the course because we don’t need that we can get that through the department. So we can remove that row. Let’s this back over here. And by the way, as I’m drawing these relationships here, we talked about before, where the quote with the crows foot notation in our homework, so the crows foot notation, with the three lines. This is this means many. This one here that I’m putting on that is one or many. And then if we had a relationship that has the three the three ticks, but with an open slot, let me find it, here we are, this one here that is zero, or many, that means a department must have can have zero or more courses, or one or more courses and this situation here. And some of like, as I mentioned, some of these relations here cannot actually be strictly enforced as part of the physical implementation of your database.

So if you initialize your, if you initialize everything, and you say that while the department must have a course, well, some of these things are not actually you’re not actually able to physically enforce as part of the database design in Microsoft SQL Server or Postgres, or MySQL, or whatever database management system you’re actually using. So this is primarily a This is primarily a logical, logical implementation of our database as a whole. But we have our line being drawn here to show that a department will have at least one or more courses. And a course can only be assigned to a can be offered by one and only one department. We’re not going to open up the can of worms here that you know, maybe a courses co taught, and it’s offered by you know, two departments, we’re not going to go down that road for this design. But that could be potentially something that we might have to account for down the road. But let’s continue on expanding this a little bit. Because we also want to consider our college along with this. Because with our college that we have here, again, we have the same issue that we had with our course, we have a we have the opportunity here if a college has more than one department that we’re storing redundant data. So let’s go ahead and make ourselves a college table. And then within this college table, we’re going to have a college ID. And let’s do a couple extra rows here and I’m actually just going to do a name for our college I’m not going to bother quite yet with make this unique and make this our primary key. And then we’re going to draw a relationship from college to department And a college will have, maybe, let’s go ahead and enforce one, at least one department or more. And then a department must be associated with exactly one college.

Also keep in mind as I’m drawing these relationships, the the spots where I actually connect it to on the UML diagram here, whether I connect it down here or up here that has no relation, you won’t actually see the linkage drawn between the actual fields. It’s just the linkage between this table to this table or this entity to that entity. And now then, we actually need to make this here a foreign key and change this to college ID. All right, so now we are on the road to having less duplicate data as a result of writing this out, but our design here is pretty flawed over all, primarily because we still can’t tackle this thing, this aspect right here, a course is offered certain semesters at the same time on various days of the week. And so all what we’re actually able to are all what we’re achieving with that, right now is just this semester column right here, as part of our course. And having the semester all all about information packed into one column is quite flawed, we’re not going to be able to query where our queries aren’t going to be able to be very expressive. It lacks consistency with data integrity, and all those sorts of things. So let’s work on splitting that out from our course table. So I’m going to shift our entities down here a bit. And I’m going to make, let’s see here, probably needs a few extra tables that I’m going to put up. So let’s put one over there. Put one up here, and then I’ll put a couple down here. Alright, so let’s go ahead and see what we can do to pull out semester. So first off, let’s make us semester, a semester table. Because things like fall, spring, summer, all those sorts of things we don’t want to list out and in one column, fall 2020, or fall 2021, or spring 2022, or whatever it may be, because again, we’re duplicating the idea of have the term right fall, spring, summer, or intersession, or things like that. So we don’t want to duplicate that information. So we want to attach an ID to that.

So we could even say this to be term. And let’s make this primary key term, ID. Let’s a nother row here. And this is going to be a unique field. Term name. Over here, let’s see here. What do we need to actually start to pull out from our courses? Well, let’s rename our semester to B term. This is going to be term ID. This would be foreign key. And then we would have something like year here. But even with this, if we drew this as part right, we’d end up duplicating information. As we offer, let’s say cc 520. We offer it this spring, but then we offer it again, the following year, and so on and so forth. And some you know, some courses, of course are offered every semester. And so we end up again, duplicating information about the course write the name, description, all those sorts of things that we don’t really need. That doesn’t really change from semester to semester necessarily, but we we so we don’t want to actually duplicate that part. So let’s kind of keep on abstracting this a little bit farther. To be, let’s say, course schedule, so we can keep track of when something is actually offered. So let’s include here. Course Schedule ID. I’m doing some of this on the fly here. So of course, there’s probably some better ways of doing this by did kind of want to illustrate how we might iterate over our database design as we go. So we have our primary key. And we need at least a couple of fields underneath here. So let’s have a foreign key to be term ID. So let’s put that up there. Let’s shift this up a little bit. So we have some space. And let’s take out let’s take out this row. And then we’ll go ahead and draw this relationship, let’s flip it. So we’ll have a course here with zero or many here. And then let’s say we have course, start date, course and date, but it doesn’t. Well, it’s just not make as much sense here to call this a course schedule.

But let’s take this back out and rename this term schedule instead. Because mostly like fall, the fall 2021 semester starts at a specific dates, and so on. So let’s rename this to term dates. Alright, so we have like, the fall term, or spring term, or so on and so forth. And then we have the date associated with this here. And then with that, maybe we also have Above this we have we could we could get, actually, no, we don’t want, we no longer have to store a year because it’s kind of implied from our terms start date. So let’s go ahead and remove year there. And with that, then we can also remove year from our course. So a course is offered on a schedule let’s go ahead and model this. So like our will assume for now that the course the course, the course names and stuff like that don’t actually change. So let’s then go over here and make our course schedule to do here let’s shift some of this over. Alright, so we have a course, we have our term. And now let’s go ahead and try to extract a little bit more information here. As a result, so we have to actually have some way to combine our term with our course because the course is offered during a specific term. And so we want to be able to model that. So if we we could, in theory, make a relationship directly with the term schedule. We could link a course directly inside here, but that means we You’d also duplicate the start date and end date for a huge number of courses. So we don’t want to do that. And then we’d also have a many to many relationship there, which is also kind of hard to model physically inside of a database. So let’s go ahead and keep on trying to work through this.

Let’s make a schedule ID, this is going to be our primary key here. And then we’ll need to have quite a few row or quite a few columns as part of this Sony course ID will need the term schedule ID, the will also need to know the days of the week, and then also the time that the course starts. Or the location, the location. So let’s then do this number. We’ll do course ID. And then we’ll do term schedule ID. And then I’ll leave this one blank here, this will be the week. Something we haven’t quite fit, we haven’t quite got that table made yet. Love a foreign key here, a foreign key there. And then for now, I’m just going to keep this abstract and say location, delete that. But as I mentioned, here, we need some way to denote when this course is offered. So during that actual terms, so let’s go ahead and do a week schedule. And the week schedule ID. And then let’s add a few rows here. Let’s do days of the week, and then start time stop time. So days, the course the days, and then the start time, stop time. And then let’s go ahead and make these three fields unique together. So we don’t have a duplicate week schedule, right. So like Monday, Wednesday, Friday at 1pm to 2pm, or something like that. So let’s then get this key over here to our car schedule. There we go. And so let’s go ahead and also draw our relationships here. So we have a course, that may be associated to a that’s going to be associated with a course schedule. One schedule item is going to be associated with one course. Because a course can’t be scheduled more than once and a particular term. Or at least in this situation here. That there and then we have terms scheduled to course schedule.

So we’ll once a term schedule may be used and multiple course schedules. So we have, but of course schedule a single course schedule is associated with only one offering there. And you we might have a term that has no courses yet, so we’ll do zero or many there. And then we need to associate our week schedule with our course schedule, and that will be a similar relationship. We may have a time that does not have any courses. But a course schedule. A course that has been scheduled has exactly one one offering during that particular term. But zoom out just a little bit so we can get a bigger picture of what we have going on here. So this is everything that we’ve we’ve done so far. And of course, there’s absolutely some more work that we can do here. For example, we don’t have any enrollment information as part of our courses, of course now. So we have our college department, the course, and the time that the course is actually scheduled. But we have no information on being or we have no way for a student to actually enroll in said course, either. So I am going to leave that portion of this particular database a exercise for you to do on your own. Likewise, things that I haven’t included here, there are a lot of various things in my design here that may or may not work. Or, like, for example, I haven’t put unique keys on things that probably should be unique. I haven’t added all of the other additional tables that we might also need to store this type of information. But you kind of got to start to see the the picture that we did, right, we started off with a very large course table that had everything in it.

And things particularly things like the departments, the semester, all of that, and we broke that out into their own tables. So now we have our own department table, our own college table, that way that information is not duplicated, we now also have fewer ways that the course information is duplicated, because we separated the course from the scheduling. That way, we don’t have to repeat the course name and description and things like that. So this is really the start of trying to design a good database, right trying to reduce redundancy, trying to increase consistency among data. And overall make the your SQL that you can write much more expressive. If everything is crammed into one table, or one column, things like that, it becomes more difficult to write actual SQL queries. But that will include my example, that we talked about here. And the follow on videos will actually spend a good amount of time talking about the underlying theory behind what makes a good database design. We’re not going to have a huge focus as far as a lot of practice on the database theory, but I’ll be covering it in a more general sense. So hopefully that you catch the overarching ideas that is important to designing a good database

Obtaining Good Database Design

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to start taking a look at some of the fundamental database design theory, that makes a good design better. So we’ve already covered some things that centers are on, on what not to do with a database design, right? primarily centered around inconsistency with data and redundant information. We saw this with the university course example as well. But data redundancy can be a problem with you know, with redundant data, which record is the consistent one, which one is the truth, right? If we have the department name for computer science listed more than once, which spelling is the correct one, this becomes even more of an issue with misspellings of people people’s names or emails or things like that. Um, so data consistency is a huge issue there with data that is copied in into multiple places, especially when we deal with things like deletes or updates to information as well. So if we have to update things in multiple locations, some may get missed. And so that data record is no longer consistent with the original. And so that introduces this idea of data anomalies, behind data, redundancy, data consistency, and all sorts of other issues, including an the worst case scenario losing information. And so bad database design isn’t just about an efficiency or performance issue. It really highlights issues with keeping our data safe, secure, and true or consistent with how we actually store it.

That brings a good question to how do we actually achieve a good database design? Well, we’ve already showcased some of the issues that we encountered in representing a course as part of a university, and trying to reduce the amount of redundant data that we’re actually storing. But it’s much more than just redundant data. So overall, a lot of the underlying theory that we’re going to be focusing on is this idea of normal forms, which have been developed over the course of many years through a lot of different database researchers, a normal form is going to typically guarantee for a database if a database design follows a one of the normal forms, it’s going to have this guarantee the sets have guaranteed different properties associated with it. And we’ll talk about some of those properties here and a little bit, the most common normal forms that you’ll see are third normal form, and Boyce Codd Normal Form. And we will cover both of these and a little bit. But they aren’t going to be the core focus of what we’re talking about today. But third normal form and Boyce Codd Normal Form do guarantee certain types of data redundancy. And there’s quite a few different types of data redundancy overall. But as far as normal form normal forms go, there’s actually five Normal forms in addition to Boyce Codd Normal Form. And the higher you go here, and number. So the more you go down into the center, the more consistent your data is going to be.

But the sweet spot is typically going to be there around Boyce Codd, or third normal form. So these are going to be the most common goals behind the database designs is what you’re trying to shoot for. But what is a normal form? Well, overall, normal forms are going to be defined using two primary things, keys and functional dependencies. We’ve talked about keys in the past and the idea of a primary key which uniquely identifies a row. And we haven’t really hit on functional dependencies quite yet. But generally speaking, normal forms are going to have a lot of restrictions on these two things, particularly around keys and functional dependencies in order to guarantee certain types of data integrity and data consistency. But we’ll talk about those in a following video.

Functional Dependencies

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re gonna be talking about functional dependencies in regard to normal forms. But what is a functional dependency? Well, a functional dependency is a form of a database constraint. functional dependencies in general aren’t always defined as part of a physical constraint on your database or inside of a table, like a unique constraint, or a foreign key or a primary key. Finding them is really the core essential part of getting a database designed well. In general, functional dependencies are going to be used for normalizing our database or normalizing our tables or relations. Usually starting with some relational schema, overall overarching idea, finding those functional dependencies between tables between entities, and using them to design a better database overall. So what is exactly a functional dependency. So a functional dependency is a set of columns, that implies another set of columns within a table or relation if we’re talking about other terminology. So at a very basic level, here we have, for example, department and course number implies course name, and course description. We can’t just say course number implies course name Course Description, because we could have, you know, CIS 115, or CC 520, or cin CIS 520. So, database essentials, and CIS 520, which is operating systems. And so we can’t just say course number, but if we say department course number.

So if we know cc 520, then we can pretty much know what the course name and the course description are going to be. So formally, we have this kind of notation. So if we have attributes a one a two through a n, so the dot dot dot means eight through. So this works for any number of columns or attributes as part of a table, the right arrow here is going to mean implies. So attributes a one through a n implies B, one through B M. So if we have department course number, that implies that we know also no course name and course description. So this is the formal logic that you would actually see if you try to look up some functional dependency information online. And formally, it’ll look something like this. So if we know columns, a one a two through a n, then we also know b one, b two through BM. So finding these functional dependencies are crucial to creating a good database design. But when does this functional dependency hold? Right? So how do I know that department and course number means if I know department and course number, then I also know course name and course description. So does that How do we know that actually works across our entire database or within our table. So formally speaking, a functional dependency will hold if for all rows, that functional dependency is true. So in other words, if we have this relationship s, with columns, a one through a n and b, one through BM, so we have all of these columns, this is one table inside of our database, and we have rows denoted by t and t prime here.

So imagine that we have you know, anything any number of rows as part of our table, if rows t and t prime agree or a one through a n, then they also agree for B one through BM, meaning that for every single row in our relation, or in our table inside of our database, this functional dependency holds true. So if I had a table that had all of our departments, and course numbers, along with the name and description for all of those, then we could go through row by row and check to see if our statement that we made earlier here, department course number implies course name course description, we could go through each and every single row to make sure that that statement holds true. So really, what functional dependency functional dependencies here and the notation that I’m using really boils back down into some formal logic, and a lot of it can boil back to what you learned with conditional logic Boolean In logic when you first started learning how to program but in the following videos here we’ll take a look at some examples of various kinds of functional dependencies

Functional Dependency Examples

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re going to be taking a look at some examples of functional dependencies. So first we’ve talked about when a functional dependency may hold officially as part of a relation. So a functional dependency holds or doesn’t hold on an instance of a relation or table. If for all cases are all rows, and that functional dependency holds true, or we find a row, that’s where a row where that functional dependency does not hold true. So we have this example bit of information. Here we have column student id, name, phone and department. And let’s say for the sake of this example, we say that our student ID in informs us that we know we can infer name, phone and department from just the student ID. So if we have a computer science students, we can say that that computer, we, if we have their ID, we can infer their name, and the phone number and department that they belong to. Let’s say we also can say that if we know the department, then we also know the phone number for that department. So let’s see if if these things hold true here. Well, if we look down the student ID column, this holds true right student ID implies name, phone and department because if I look at student ID, all of the bits of information, name, phone and department are unique for every single student ID. So we have no duplicated information, right? So if, for example, though, if both of these students here student ID, 3542 and 1111. If both of them had the same name, then we would have some issues, right, our functional dependency would not hold, because there would be no way for us to tell the two apart. So let’s continue on looking at some of these. So department phone, well, math holds true. So math implies 1234. And g implies 1234. Cis implies 9876 97. Six, that’s consistent.

But we can also then, let’s try this other way. Right? Phone implies department. So if we have phone implies department, let’s look. Well, that works. Okay. But what about this, right, this is where we have an instance, where we have something that does not hold. So if we had, the student ID example is kind of hard, a little bit harder to see. But it’s a little bit more clear. If we talked about phone number. So if we have, if we had the same student ID but different names or different departments, then we’d have an issue with the student ID. But here we have two departments that share the same phone number. And so department implies phone still holds that functional dependency, but phone implies department does not. Because if 1234 implies math, and then here 1234 implies English. And so this functional dependency, we can say does not hold for that situation. So this means what the functional dependency and functional dependencies hold only if the data being stored, holds, if it matches through with all the rows that we’re actually storing inside of our database. If we introduce a record to our table that causes that functional dependency to not hold, then we have some some form of issue either our application is not catching all scenarios and cleaning the data that we need cleaned up before it’s being stored, or we have an inherent flaw in our database design. Another you know, a confusing example I guess with with functional dependencies here are the idea of city state and zip.

Even if we consider like Kansas City, Kansas, Kansas City, Missouri, or even was just Kansas City, Kansas side right? There’s Letha, Shawnee, Olathe North of the South, we have all all these different zip codes. And the zip code does not necessarily imply city name because a zip code can span multiple cities. A city can span multiple zip codes. And a city can also span multiple states. So city state and zip is usually something that someone we all default to well, if we Know, the zip code, then we know, the the city or the state, and so on and so forth. So there’s some really confusing things that can with data that can make it difficult to actually write good databases, or good tables as a result of that. But let’s take a look at this example. Here, we have a few functional dependencies here. And remember, our functional dependencies are constraints inside of our database. On some instances they hold. And on others, they don’t, as we saw with our previous table with the phone number and the department, but do all of these functional dependencies hold for this particular example? So take a pause real quick in the video and take a look at the data and check to see if all of these functional dependencies hold. Well, let’s take a quick look. So we have name implies color. So iPad silver, iPhone, silver. So this is okay, so far. We don’t have so good so far. Right? Category store, categories, store, Gadget, campus store, gadget campus store, good so far. And then we have color and category. So gadget silver implies price 529. So gadget silver 529. Gadget silver for 29. Ah, well, this functional dependency here does not hold according to the data that we have in our database right now.

But what if we added an additional additional row here, right? What if we add an additional row? How about here Well, name, category, name, color, name implies color category implies store. That looks to be all good. But let’s take a look at color category again. So gadget silver implies price 529, gadget black, so we’ve changed the color here, Gadget, Black 429, tablet, Silver 569. So, so far, so good. With this particular situation, right? They all hold here. But just because they hold all all, just because they hold here in this instance, of our table doesn’t necessarily mean it’s a good functional dependency. And we can enforce the idea of a functional dependency on in some cases, we can add things like check constraints, which we’ll talk about later in the course, to enforce functional dependencies inside of our physical design of our database. And then we can also enforce restrictions on the application side to filter out data. So we don’t actually violate any of our functional dependencies that we have defined. But they all they all hold here in this particular example. Generally speaking, we’re going to, or at least our goal with doing a good database design is we want to be able to extract and identify all of our functional dependencies that we have. So how do we actually achieve that task? Well, if we have, just for a sake of example, our functional dependencies that we had just just a second ago, if they all hold, then we can also imply name category implies price. But well, why? Well, if we have a set of known functional dependencies, typically we can actually extract more functional dependencies out of that as a result. And in this case, we can extract the name category implies price through a transitive property. So if name implies color, and color category implies price, then we can also say color character. Then we can also say, name. Category also implies price because name implies color, color, and category implies price. Therefore name and category implies price as well. So just one method for us to extract more functional dependencies. But in the following video, we’ll actually take a look at more ways that we can extract all of our functional dependencies. And a lot of times, we are defining functional dependencies initially, just based off of our inherent knowledge of our data, and then we can use these other techniques to pull out more

Anomolies and Armstrong Rules

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be taking a look at various ways of pulling out more functional dependencies. Inside of our relations. Particularly, we’re going to be focusing on how we can avoid anomalies and pulling more functional dependencies out using the Armstrong rules. So first off, what is an anomaly? Well, anomaly. An anomaly can be defined as essentially a bad functional dependency, and particularly where where a bad functional dependency holds based off of the current data that we have. So typically, we will know some of the functional dependencies just based off of our knowledge of how the data works in the real world. But we really, truly need to find all of the functional dependencies, so we can identify the bad ones, because the bad ones are going to be the things that causes data consistency issues, and integrity issues down the road as more data is added into our database. So how do we find all of these as a whole? Well, there are a few methods that we can look at to find all functional dependencies. The first one here are called the Armstrong girls. Now, this will cover a little bit more into the the heavy theory side of things with databases. But we are going to cover these just as an informative step. But Armstrong’s first rule is called the splitting and combining rule. So if we have a one attributes a one through a n implies B, one through B M, that means these are all equivalent, right?

So a one through a n implies b one implies b two, and so on. So if we have a multiattribute, functional dependency on the left, that implies the right, then we can split the right hand side of the functional dependency out into individual functional dependencies. The second rule of Armstrong’s rules is called the trivial rule. And this one is relatively straightforward. So if we have a one through a n implies a i, where i is one of a one through a n, right? So in other words, if we have something like this, ABC implies a, ABC implies B, and ABC implies c. So those are all trivial, right? So you know, the student ID is obviously going to imply that we know the student ID, and so on. Our last Armstrong rule that we’ll be covering is the transitive closure rule. So if we have attributes a one through a n implies B, one through B, M, and B, one through bn implies c one through C p, then we can infer that a one through a n implies c one through CP. We did this already before. Using the name, name and color category implies price example in our previous video, but let’s take a look at some functional more functional dependencies centered around that. So here are the same ones that we had before name, color category, category and Play Store and color category implies price. So from these functional dependencies, we can imply all of these other functional dependencies as a result. And let’s break down which Armstrong rules that actually apply here.

Well, the trivial ones are pretty straightforward and easy to go. So name category implies name. That’s obviously the trivial rule because we have the same column on both sides. So that means this is a trivial functional dependency. Name category implies color. We’ve already covered this particular one in a previous video, but this is the transitivity rule, particularly on the dependencies four and one right since we know name and category implies name. Then name category also implies color because name implies color, we have six, which is a trivial rule again, because we again, we have the same cat or the same attribute or column on both sides of the functional dependency. Seven is going to be the split and combine rule based off of five and six. So since we have named category implies color, named category implies category, therefore we have named category implies color category. So we’re coming Binding five and six into one single functional dependency. And the final line here name category implies price is the transitivity rule based off of seven and three. So since category and color implies price, and we have name category implies color category then we have name category also implies price because color category here, this right hand side of seven is this left hand side of this functional dependency. So therefore, we can substitute price for color category because color category implies price. But this is kind of hard, right? I mean, overall not too bad. Most of the rules are trivial or easy to combine or split. But this is kind of painstakingly slow to actually implement and the larger your tables and relations are, the more difficult the Armstrong rules actually get to apply to extract all of the functional dependencies. So in the following video, we’ll take a look at a little bit of an easier way to extract all of them.

Closure Sets

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re going to get to our discussion on extracting more functional dependencies from our database. So before we ended on the Armstrong rules, which worked fairly well in pulling out more functional dependencies, but we found that it was a little bit tedious to actually do. And if we applied that to a much larger schema or much larger database becomes a little bit impractical and use. So that brings us to the idea of a closure set. So a closure set is defined as a set of attributes a one through a n, and the closure set of the that set of attributes is defined here, using this plus, right, so this is, if we have a one through a n with the curly brackets and the plus sign. That means this is the closure set of those attributes. The closure set of those attributes is the set of attributes b such that a one through a n implies E. So essentially, everything that you can imply for a given set of attributes, that is a closure set everything that we can imply from a set of attributes. So if we consider our previous example with name implies color category implies store and color category implies price to what are the closure set of these functional dependencies. So to start out, by computing the closure sets, we’ll list out each of the functional dependencies that we have. So name. So we’ll first start out with name. So what can we imply from name. So initially, we can just say, we can imply name and color, right name being the trivial implication, because we can name implies name. And we can get name and color, we can get color from our original functional dependency up here. So what about name and category? So what about name and category? So I’m just listing some of the individual attributes down here.

This isn’t necessarily a direct one to one mapping up here to our functional dependencies, although we’ll use our known functional dependencies to compute our closure sets. So what can we imply using name and category? Well, initially, we can just put the trivial ones first, so name and categories and then category. But what else can we imply? Well, name itself implies color based off of our functional dependency appear. And this existing closure sets, we can also imply store primarily because we have category up here as part of our functional dependency. So category implies store so we can include store here. And since we have color and category now, on the side here, that means we can also imply price based off of this last functional dependency. And so if we have the name and category, we can actually imply all of the other attributes as parts. But with color alone, we can’t imply anything other than just color. Now, this isn’t all closure sets. Based off of these examples. If we wanted to compute all closure sets, we would exhaustively go through all the combinations of attributes on the left hand side, but a lot of those would actually result in the exact same closure set. So we don’t have to compute all closure sets, because a lot of them are going to be duplicates of each other or equivalent in nature. But there is a nice handy algorithm that we can utilize to compute our closure sets. So if we have a set of attributes x a one through a n, we’re going to repeat this algorithm until x does not change. If b one through B, N is a implies c is a functional dependency and b one through B in our all in x, then we’ll add C to x. So let’s take a look at an example here. So if we take a look at our name, and category, right, so we can look through all of our functional dependencies here, right, so we have a name, category, color, store price. So name gets added, because name and category gets added by default, because those are the trivial functional dependencies. That name implies color. So color gets added first, we’ll loop back up and then try a another sets. So or try another. So if we can extract another functional dependency out from here.

So we can now that we have color, we have category and color implies price. So that is another functional dependency, right. And so that means we can add C, which is price to our closure set. And now that we add price here, we don’t actually have any other functional dependencies that we can extract. So our algorithm stops. So hence, right, we have a new functional dependency, right? Name and category implies color store and price, which is something which is a functional dependency that we did not have before, before we started calculating our closure sets. So this is one benefit that we can actually do, in terms of extracting more functional dependencies is computing the closure sets and see what other functional dependencies we can actually pull out. Typically, it is normal to in our closure set include the trivial, the trivial inferences here, like name and category. But when we, when we define a functional dependency, typically our functional pin are we don’t include in our functional dependencies, the trivial attributes, so we don’t include named category as part of on the right hand side as part of our functional dependency. So we stripped those out. Let’s take a look at a quick example here of this working in action. So if we have a relationship, a relation or table, A, B, has columns A, B, C, D, and F. And we have these functional dependencies here. What can we actually achieve here? Well, first off, we’re going to compute two closure sets, just as an example as part of this. So we have kala a closure set of A and B and a closure set of A and F. And we start out by essentially just putting the trivial ones out there, right, the columns that are on the left, we can just immediately put those on the right, so let’s take a look at our individual functional dependencies here.

So first off, we have A and B. So A and B are going to imply C, so we can include that down here and our closure sets. So let’s put C here. And then with the other of the other inferences, that we can make b implies d, so we can add D to this closure set. And then we also now are, we now have a and d as part of our closure set. So that means we can also imply E. So those are the attributes that we include as part of the closure set of A and B, let’s take a look at a an F now. So a an F, by itself, a f is a functional dependency implies B. So that is the first one we include. Now that we have B, we have a and b as our functional dependency up here at the top, so we can put C and R our closure set. Now that we have, we have B and R, we have B as well. So we can add D to our closure set. And now that we have A and D in our closure set, we can use this functional dependency to include E. So this is a I think a much quicker and easier way of pulling out all of the things that we can imply, given a set of attributes. And that allows us to identify more functional dependencies. And in this case, we just identified a couple new functional dependencies. So A and B, also a and b implies C, D, and then a f implies B, C, D E. And so those are two new functional dependencies that we’re able to extract by and by creating our closure sets. In general, why do we care? Why do we need closure? Well, with a closure set, we can find all functional dependencies as part of a relation. And with that, after we if we have a closure set, we can actually confirm whether or not a set of attributes implies another. So if we compute x plus or the closure set of x, so if a is in the closure set, we can confirm that that functional dependency does exist. So this will become a little bit more apparent as we work and build on this foundational database. You Particularly around defining our tables. And so one of the big reasons why we want to extract all of our functional dependencies is so we can make sure that our tables are in normal form. And we can use this as a way to identify keys as part of our relation as well. But that will be continued on in a another video.

Super Keys

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re gonna continue our discussions on how we might design a good database. So last time we talked about third normal form Boyce Codd Normal Form, and how those normal forms can be defined using functional dependencies, and keys. And in particular, we’ve spent a lot of time talking about functional dependencies and what that entails. And we also defined how you might calculate the closure sets, and how those closure sets could help us find all functional dependencies, along with some other techniques for finding those as well. But in this video, we’re going to focus on the second part of what makes up our normal forms, particularly with keys. So we’ve talked about primary keys, and what that entails with our database. So if you remember, right, primary keys are a column or set of columns that help uniquely identify a row inside of a relation or table. And so there are I’m going to take that just a little bit further and talk about keys and super keys in particular. So what is a super key? So a super key is a set of attributes a one through a n, such that your, I can spell this out here says such that for any other attribute B, we have a one through a n implies B. And remember from our previous videos, the underline the arrow is the implies symbol, to avoid duplicates, a, it’s a very important notion in a database to refer to or have a key of some kind. And that key being a primary key, or in other words, a super key that helps identify all other columns, that in turn, helps uniquely identify any given row inside that relation. So I have another super key that I want to define here.

A minimal super key is a key that is a set of attributes, which is a super key, and has no sub set of attributes, that is also a super key, right? So meaning a key or a super key that has no super key or set of attributes inside of it, that is also a super key. So if we have attributes, you know, a one, a two, a three, a four implies B, so for a super key. But if you know a one and a two is also a super key, then that’s a one, a two, a three, a four is not a minimal super key. So we want typically, when we talk about primary keys, with database design, a primary key is going to be a minimal superkey. Or at least a good primary key will be a minimal superkey. Because we don’t, you know, if we don’t need the additional attributes to make it a primary key a key that uniquely identifies a row, then it’s it’s kind of a waste, right? Both in terms of design, which complicates your queries. And to in terms of efficiency, and, and storage. We’ll talk about later in this class about indexes, and how that impacts you know, your your, the speed of your query, and those keys are going to be indexed. And so if those keys are are larger, it’s going to increase the amount of time that it takes to actually index your tables. But indexing will be a talk for a later lecture.

But for now, let’s talk about let’s talk more about our super keys. So previously, we talked about how we could compute our closure sets, so x plus our closure sets for all attributes in our relation. But primary reason why I’m bringing this back here is that if the closure set of x is all attributes, then x is a super key, right? So if the closure set of you know student ID is all attributes meaning that this that you can imply all other columns with that attribute, then that column is a super key, or set of columns is a super key. And remember, we want only the minimal super keys in the end. So we will spend time to calculate all super keys, and then we want to reduce those super keys down to the minimal possible number of attributes that still maintains a super key property. So let’s take a look at an example of what this may look like. So here we have an example, relation called enrollment that has students address course room and time columns. And for the sake of example, let’s say we have these functional dependencies here. So we have student implies address room and time implies course. And then we have student course implies room and time. So with those functional dependencies, how would we calculate what keys we have here? Well, and and when I say keys, I mean minimal super keys, minimal superkey. So how do we calculate our super keys? Well, we may also have, we may have more than one key here as well, right?

That’s totally viable and reasonable as part of the table, there may be more than one key, or more than one possible key, we will want to only when we actually define our tables in Microsoft SQL Server, or whatever it may be, we want to make sure that the we only define will only have one primary key. So in this sense, we want to start out by trying to figure out what set of columns implies all others? Oh, let’s take, let’s start breaking this down. By by student here. So student implies address. So and what can we get from address? Well? Nothing really. So initially, we’re just going to have students as a as part of the key. But students alone isn’t enough, right? Because we can only imply address from students. How about adding, and address or address implies nothing? So address doesn’t want to be address shouldn’t be part of the key. So let’s say so we have student not address how about course here. So if we have course, and students, right, so student course, implies room and time, right. And then so let’s that adds to our key. And then room and time implies course. And so we get essentially everything that we need from from student and course. Right. But since since we have and I’m going to, let’s go here, I’ll go to my pen here. Since students room and time, right be highlighted attributes here are going to be our minimal superkey. But the attributes that I have underlined in red, there are going to be a another super key because with student stute implies address, so we got that. And then room and time implies course. And so all the other dependencies are all the other attributes are trivial, right, trivial functional dependencies. And so we get two super keys out of this relation, yeah, student room in time, and then student course.

And so really, typically, what we’ll want here is a set of attributes that are a minimal superkey that we want to use as our primary key, and in general, will typically want to go with the key that has the fewest number of attributes. And this is, in particular, it makes your queries a little bit easier to write because you have less columns to worry about to uniquely identify a row and improves your indexing because you have to index on fewer columns. What and so we’ll talk about that in a future course. But here you can see a little bit of an example that I use to show how you can go from your functional dependencies to calculating our keys. And remember, we want for our keys in our table, we want a minimal superkey not just a super key. So if we had if we had a super key. So for example, if I had course as part of my key here on the left. So if I had if I added course over here to student room and time, that would not be a minimal superkey because A student and course is also a super key so it is not minimal so we want to reduce that to make sure that our keys are minimal

BCNF Decomposition with Functional Depenencies

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re going to be taking a look at how we might start improving our database design. And in particular, decomposing our tables trying to achieve Boyce Codd Normal Form. But first off, how do we use the information that we have now about Super keys to eliminate anomalies in our database design. So, one, one key fact that I’m going to try to hammer home here is that each attributes must provide a fact about the key, the whole key and nothing but the key. This is a an adaptation to William Kent, who is a famous database researcher. But in general, when we start writing our or deciding on our keys, right, all attributes in a relation should provide information about that key, and nothing else. And so we’re trying to reduce our table a reduce a relation to the minimal number of attributes that can be represented using that key and, and transforming the other attributes down into their other into their own relation. This is what we do when we try to normalize a database. So remember, Boyce Codd, normal form a relation, or table R is in Boyce Codd Normal form if and only if, for every functional dependency x implies a. x implies a is a trivial functional dependency, or x is a super key. And so this is where we bring back that super key information. We’ve talked about Boyce Codd Normal Form in the past, but we really didn’t have all of the necessary definitions to go for it. And if you’re looking for the relational algebra definition here, for all, so that’s what this upside down a is. for all for all x, where x is a set of attributes, either the, the closure set of x is x, or the closure set of x is all attributes. So this is this is this is the trivial trivial functional dependency. And then this here, right? Remember, this is a super key.

So let’s take a look at an example. Because it makes a little bit more sense, trying to look at this in a little bit more concrete manner. So we have this relation here that has name ID, phone and department. We’ve seen this before, when we are talking about anomalies, and how we might have bad functional dependencies. So here we have a functional dependency ID implies name and department. So Id implies name and department. So 123, Fred, and C is right, and C is 987. Is Joe and math and Joe and Matt. So this all works out just fine. But what is the key here? What is the key? Well, it can’t just be ID by itself, right? Because we have the situation here with Fred and Joe, who have two different phone numbers for the same ID. And so Id by itself won’t work. We’ll have to tack on phone along with that. So most likely the other the other functional dependency here is phone implies department. But that really makes ID implies name department a bad dependency, right? Because we have another dependency inside here, right? We have we’re missing phone right? For one, right? We’re and Id implies name department has a bad functional dependency, because it doesn’t really capture this right? We have this issue where the ID produces a different phone number for for the individual people. So how do we correct this right? How do we help fix this anomaly and our relation? So I’m going to show this algorithm. It’s a lot easier than what this initially looks like here. But we can decompose our tables using boys Normal Form, particularly using functional dependencies. So Boyce Codd Normal Form decomposition using functional dependencies.

So we’re going to choose a set of attributes a one through a m, such that it implies b one through B in. So this is just a fancy way of saying a functional dependency, right. So we’re going to choose a functional dependency that violates Boyce Codd Normal form looks like I have a typo there BCNF. And then we’re going to split our table into R one, and R two, r one is going to be just the functional dependency. And r two is going to be the left hand side, the left hand side of our functional dependency and the other attributes that were not included. And we’re going to repeat this with R one and R two until we have no more violations of Boyce Codd Normal Form. So this is what it looks like with a Venn diagram. So r1 is going to be his r1 is going to have all of a so this here, right all of a and then the let the right hand side of our functional dependency. And then our two is going to have a and all the other attributes that were not included in our one. Generally speaking, a if a relation has only two attributes, it is always in Boyce Codd Normal form because either right, either there are no trivial functional dependencies, meaning that the two columns imply themselves or the two columns imply the other, or the two columns imply each other. And then otherwise, we have A implies B, but B does not determine a sort of A implies B, but B does not imply a so A is the key. Or three we have it the other way around, we have b implies A, but A does not determine B, so B is the key, and so on and so forth. So or we or we have both right, a implies b and b implies a. So both are the keys.

So if you’re if you get your tables down into two columns, then your table is guaranteed to be in Boyce Codd Normal Form. But this here looks a little bit more complicated than it really is. So let’s take a look at an example of this in action. So here is our table that we had earlier. Remember, we determined that Id implies named department has a bad functional dependency. Because we have this issue with the phone number, we have this issue with the phone number. And so we want to decompose this relation, right. So we’ll have two tables here we’ll have name, ID and department as one. And then we’ll have phone and ID as the second phone and ID as our second, remember, because we are decomposing into r1. So if we let’s go, I’ll go ahead and shift this over here. Right. So this is our decomposition. And then this is our r1. This is our our two, right. So r1 is based off of just the functional dependency, right? And then r two is the left hand side of the functional dependency plus the rest. Okay, so r1 is the functional dependency, the left hand side and the right hand side of our functional dependency. And then our two is the left hand side of our functional dependency, and then the rest. The rest of that meaning all the other attributes that were not included in our one. And now in this case, now we don’t have that weird issue with phone number. Over here, we can just have our ID as our primary key because Id implies name and department. And then over here we have that same same thing that we had before we have ID implies phone, which doesn’t actually work, right because we have 123 but 123 has two different phone numbers. So the primary key here is ID and phone right so I ID and phone the both columns together. But this is now in Boyce Codd Normal Form. This is a now this is now in Boyce Codd Normal form so that decomposition helped us remove the anomaly from our table

BCNF Decomposition with Closure Sets

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re going to be taking a look at Boyce Codd Normal Form decomposition again. But instead of using functional dependencies for the basis of our decomposition, we’re going to use Closure sets. Now in general, I find closure closure sets to be a little bit more complicated to use for decomposition. So I typically lean on using functional dependencies as the basis for decomposing my tables and normalizing them. But closure sets can be a very consistent way to find all of our functional dependencies and therefore have a more guaranteed normalization results at the end. But I’ll show an example here in a little bit where the decomposition using Closure sets actually can generate more than one set of resulting tables. But what does the algorithm actually look like? So we have this is going to be like said, this is going to be a little bit complicated to rebel, try to read this, read this out slowly here. Find the x such that remember S T stands for such that. So find x such that x is not equal to the closure set. And the closure set is not equal to all attributes. So x is not a super key. And remember, x here is referring to a set of attributes, right? Just like what we had before, right, so x is a set of attributes. So if not found, then r is in Boyce Codd Normal Form. Okay. So this is essentially kind of work kind of going back to our like bad functional dependencies that we use last time to be composed. But here instead, functional dependency, that is not a super key. So let’s keep on breaking this down then. Okay. So if, if we did find a set of attributes, right, if we did find a set of attributes, that is not the full closure set, and not a super key, then let’s create a new set of attributes. That is the that is all attributes in the relation minus the closure set.

And then we get to lynoral decompose our relation r into r one, which is the closure set of x and our two, which is x union y. So very similar to how we how we broke things down with our functional dependency, right? Remember our one before, was, our one before was the just using the functional dependency left and right hand side of the functional dependency. And said here, we’re just using the full closure set of x. And then our two is going to be x union with everything else, right. So the attributes x that we use, that we use to compute the closure set, combined with the rest of the attributes from the relation that we’re decomposing, that are not included in the closure set that are not included in the closure set. And we repeat this until no X is found. So meaning that there is no more closure sets that are not the super key. So this is complicated. In general, this is kind of hard to follow. Let’s take a look at this an example. So here we have this relation student that has a name, ID age, hair color and phone number. We have these two functional dependencies here ID implies name and age, and age implies hair color. So we’ll do our algorithm here. So we’ll we’ll have a couple different iterations here. So we find x such that the closure set of x is not all attributes. So the closure set of ID is name, age, and hair color. Along with ID, of course, the trivial. And that is not all attributes because it excludes phone number, we can imply phone number here. So that’s what we’re referring to the ID is not a super key. So we decompose that into two, call our two tables, right? decompose that into two tables.

So this here is our one. This here is our two. And remember, r one is the closure set of x. And r two is x union y. Right? Where Y, in this case is ID, name and age. So this is IB plus minus the rest. Right? So that would be ID name, age, hair color minus name, ID, age hair color phone number. So why is actually going to be just phone number, right. So the closure set of ID minus the entire minus all attributes. Sorry, let’s realize let me let me rewrite this here real quick. So for our two, then we have x, which is ID, and then y, which is phone number for our two. Now, we don’t we can’t reduce phone any farther than what we had it right, we can’t reduce vote any farther than we have it because there is no set of attributes x here, that is not a super key, right? You have no set of attributes here that we can’t we can’t decompose this any farther. Remember, a table that has two attributes is guaranteed to be in Boyce Codd Normal Form. And so we can’t break that down any farther. But we can break down student one farther, right, we can break down student one farther. Particularly, we’re going to pick on age here. Because age implies hair color. Age implies hair color here. And so an age itself is not a super key. Because we can only get age and hair color out of age. So let’s break this down again. Let’s break this down again. We have r1 Here are two here. Why? In this case is age and hair color. So age, plus minus all which is equal to id a name, right? Because all being all, all here is going to be the attributes from student one, right the attributes from student one. So we have age and hair color minus IB, name, age and hair color.

So if we take out age and hair color, from students, one we we are left with ID and Name. So we have ID and Name. That’s why union x so we have ID name and age. And so in this case now, right, in this case, now, we hair is broken down into everything that we can, because age implies hair color, that’s a super key hair color does not imply anything else. And so therefore that is in Boyce Codd Normal Form. And then Student Two is in Boyce Codd Normal Form. Because name does not imply anything. Age does not imply anything here. Because we don’t have hair color anymore. We don’t have hair color here. So age does not imply anything else. And Id ID is going to imply name and age. So that is our super key. So this is our example here using Boyce Codd Normal Form decomposition using Closure sets. Now I could have done this using functional dependencies. So this relation here, basically using our functional dependencies, right, we have age and hair color. This is one of our functional dependencies. This is our other functional dependency. So we could have decomposed this using functional dependencies But this is just another methodology for getting a relation down in down to Boyce Codd Normal Form. I’ll show a bigger example here. But this instead using more functional dependencies here. So we have a large relation ABCDE and F, and we have these functional dependencies up here, A, B implies c, c implies d, f implies B and D implies a.

So I’m going to start picking on my first functional dependency here, a B, the closure set of A B A, B is not a super key. And so therefore, I can decompose R into R one and R two. So we have R one and R two, R one being the closure sets of a b, r to being the closure set minus all attributes. And so that would be a B, A, B, E, and F. Let’s see, let’s see here, we can, we can break down our one further, because we have c, c implies d, and d implies a. So we have a C, D, which is not a super key. And so therefore, we can break our one down into r1, one and r1 two, which gives us a D, or a CD, a CD, which is the closure set of C, and then our one two, which is just c b, which is AC D minus A, B, C, D, and that leaves us with just the C union y which is BC. So that gives us sorry, not CB just B. And we won’t break down r1 to any farther because that is in Boyce Codd Normal Form, we’re down to two attributes. Over here we have the closure set of D is 80, which is not this not a super key. So in theory, we could also break that down farther. So gives us a d and d see if we broke that down. And then over here, over here, we can break down a B, E, F, and to FB and fa e because the closures that have F which is just b and f is not a super key. So we break that down into the closure set of f and then F union y, y being a and e so FB minus all attributes. And this is this are two one is in Boyce Codd Normal form because we have two attributes.

Our two two is in Boyce Codd Normal Form, because we don’t have any any other functional dependencies here because f we don’t have B here anymore. We don’t have a does it A by itself does it imply anything and he by itself doesn’t imply anything. So therefore we’re in Boyce Codd Normal Form. And then we could also identify all the keys here. So in our one, one, D is our key, D is our key, because D implies a C is our key and our 112 because c implies d our two one f implies B has F is the super key and an AR T one. There is no functional dependency here to base this off of and so all three are our super key. So the general question here is the schema that we have unique right are the tables that I broke everything down into a unique decomposition? Well, the answer to that question is no, it is not unique. So if you spend some, you know take a pause of the video here and follow this trace here. Everything is roughly the same except r 22. r two two is broken down into FC E instead of fa e right. And so this decomposition can be different. The decomposition can be different depending on the functional dependencies that we use and The order that we break things down in. So which solution is better? Which solution is better? Well, we have two, two schemas or two sets of relations or tables that we use, or that we found as a result.

So my first example here, and my second example here, well, seemingly, you know, at first glance, well, we have 1234 tables here and 12345 tables here. And so we have fewer tables here. But does that necessarily mean we have better or better tables? Not necessarily, right? Not necessarily, which solution is better? From a theoretical point of view? Both solutions are good, because they’re both in Boyce Codd Normal Form. From a practical sense, it depends, right? It really depends. In practice, you’ll take a look at how the tables are actually being used. So what are the common ways common things that you’re actually querying for? What are the most common attributes being inserted together looked for together, query together, those sorts of things. And so you want to break those things down into if you have multiple ways you can decompose a relationship relation into you’ll want to decompose it such that you have the fewest number of joins that you actually have to, you have to actually run. So if you’re most commonly using or pulling information from common attributes together, and you decompose that table to make it in Boyce Codd Normal Form, but that separates those two attributes. If you have a decomposition that keeps those two attributes together and still maintains Boyce Codd Normal Form, that that would be the preferred way to go because right, those two things are commonly queried together. And that would over time, increase the efficiency of your database. But both solutions are good because we still maintain Boyce Codd Normal Form

Lossless Decomposition

YouTube Video

Video Transcription

Welcome back everyone, we’re going to continue our discussion on database normalization and obtaining good database design. So previously, what we’ve seen, we’ve talked about third normal form and Boyce Codd Normal Form, how we could calculate the functional dependencies and the super keys, and how we can use those to find that or how we can use those to decompose our relation to break it down into Boyce Codd Normal Form. We’ve defined our functional dependencies, and how we can find all functional dependencies using Closure sets. And whether or not a dependency violates Boyce Codd. Normal Form. We also defined a super key, which is just a set of attributes that imply all other attributes. So a primary key. A super key is a minimal super key when there is no other super key in that set of attributes. So you have so we have no subset of attributes that also is a super key. And that minimal super key is what we use as our primary key for our database, or for our tables. And then we also learned that we can decompose our tables into Boyce Codd Normal Form relations or tables, using those bad functional dependencies as well if we have a bad functional dependency. So Boyce Codd Normal Form as a refresher as well, it’s in a relation R is in Boyce Codd Normal form if and only if for all functional dependencies. x implies a x implies a as a funk as a trivial dependency, or x is a super key. So for all x, either the closure set of x is x, or the closure set of x is all attributes.

So we have trivial functional dependency and the super key right so trivial, super key. Now, let’s take a deeper dive into decomposition. So So we’ve shown the example with our ID, name, and department and phone number. Here we have our name price category table that we’ve seen before. And so we can actually decompose this into name and price and name and category. But because we have a functional dependency name implies price. But this is a problem, right? This is a problem. Because if we do decompose it, well, not necessarily problem. When we decompose it, we have iPad 529 twice here. So this is actually a good thing because we do lose, we lose a row. But we we actually D duplicate, right, we d do D duplicate, we don’t actually have to have iPad 529 twice now, because we have different categories, right? We just have iPad 5.9. So this is last less decomposition. Last less decomposition. But let’s take a look at this example. If we decompose this into name, category and price category what’s incorrect here? What’s incorrect? So we have iPad, tablets, iPhone gadget, iPad gadget, we have 529 tablet for 29 Gadget 529 gadget. Uh huh. Okay. But how do we actually get this information back? Right? So the individual tables are okay, we didn’t lose rows of data. But can we re can we recombine our tables back into the original? Can we recombine this back into the original? Well, the short answer is no, no, we can’t.

Because if we look at if we look at this here, iPad 529 tablet, iPad 5.9 gadget, we can actually pair these back with the correct rows can no longer pair these back with the correct rows. So this is something that is in Boyce Codd. This isn’t normal form right this is a normal form, but this is loss he decomposition. So a decomposition is lost less if and only if we can recover the exact information we started with. So if we start off with a table ABC, and we decompose that into a, b and AC, if we can recover a, b, c, then we have achieved last less decomposition is lossy if we cannot recover ABC as a result of combining those tables again after they’ve been decomposed. So we don’t want that we don’t want to decompose our table, but lose the relationship that we had originally started with, right. So we do have to be careful when we start decomposing our tables. So, in general, if we have a relation are such that we have attributes a one through a n b, one through B M and C one through CP, we decompose that into our one a one through a n u one through BM and a one through a n c one through CP. If a one through a n implies b one through BM, then the decomposition is lossless, because then we can actually re compose the original relation, right? If that is not the case, then we have lossy decomposition. If we are doing Boyce Codd Normal Form, right Boyce Codd Normal Form decomposition is always lossless because they’re based off of our functional dependencies, our closure sets. So keep that in mind. We’re doing our decompositions we can’t just decompose our tables to try to make them better. We do have to be careful because if we decompose arbitrarily, we can have a situation where we actually lose the information we actually lose the related the relation, which is what we’re trying to leverage as part of SQL

Limitations of BCNF Decomposition

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re going to be taking a look at some limitations behind decomposition using Boyce Codd Normal Form. So previously, we talked about how we, when we’re decomposing our tables, we need to be careful because if we’re not actually adhering to Boyce Codd Normal Form, and we’re decomposing our tables, ad hoc Li, we could actually have lossy decomposition, meaning we actually lose the original information, or at least, we lose, we don’t lose rows of data. But we can’t we can’t actually reconnect those rows of data together. So our joins become useless there. But there are some limitations behind Boyce Codd Normal Form. So Boyce Codd, normal form by itself and we’re decomposing according to it. Our decompositions are always lost less, which is a good thing, which is a good thing. But what about functional dependencies? So if we have this relation here, Professor projects departments, where we have our functional dependencies, Professor implies department and projects department implies professor. This is a Boyce Codd Normal Form violation because we do have a functional dependency. That is not a super key. Okay, so we want to decompose this. And so we have Professor departments, Professor department, and then we’ll have everything else, right, Professor department and everything else. So that’s Professor department, and we have Professor project. But what about this functional dependency here, Project department implies professor.

So this has no functional dependencies. But we lose that dependency, right? We lose this functional dependency project department no longer implies. Professor, right? Project department no longer implies professor. So this is a issue right? Or this is a problem with Boyce Codd Normal Form more of a limitation, right? We still don’t lose information, we still don’t lose information, we can actually join this back together and get our original table back. So what’s the general problem here? Right, what’s the general problem? Well, what about this right, what about this set of data? Right? We have Professor department Professor project, and we have Johnson COEs Robinson CIS Johnson recruitment, Robinson recruitment. So no problem here, right? Because our functional dependencies, local to the individual tables are satisfied, right? That’s fine. But if we pull all the data back into a single table again, ah, Project department implies Professor no longer actually holds, right? Because we have CDs recruitment, CDs, recruitment, and now no longer implies professor. Right, that no longer applies. Professor. So this is either a bad functional dependency, or we can try to attempt juice do more with our decompositions. How do we keep hold of those functional B? How do we keep a hold of those functional dependencies when we decompose our tables? So let’s take a look at how we might do this. So we lose dependencies, when in relation with a dependency x implies y is decomposed, And x ends up in one of the new relations and y ends up only and another. So if a functional dependency is split apart, we lose that functional dependency.

So that decomposition is we refer to it as not dependency preserving, it is lost less, but it is not dependency preserving. So the common form of this issue is a b implies c and c implies b Right? So remember our example. We had Professor implies department, and project department implies professor. So A, B implies c, and c implies B. So this is our limitation, right? This is our limitation with Boyce Codd Normal Form. Boyce Codd. Normal Form decomposition does not always preserve dependencies. Let’s take a look at this example. This is an example that I previously broke down using decompose using Closure sets. But if we attach I’m not going to go over the actual decompose Part, feel free to track, go down the tree here and follow this if you’d like. But this was decomposers enclosure sets. And so let’s take a look at our functional dependencies now, as a result, so we have, we have d implies a, so that holds that holds, we have c implies d, that still holds, we have f implies B, and that holds. But this functional dependency here, a b implies C, no longer holds. So again, another example of how Boyce Codd Normal form of decomposition may not keep all of your functional dependencies intact, it may not keep all of your functional dependencies intact. So, general goals here, why are we decomposing them? Right? Why are we decomposing them? If we lose functional dependencies as part of it? Well, big goal that we want to target here is eliminating anomalies, right anomalies. reduce redundancy, right? When we update and delete data, right? We don’t want to have to update and delete in multiple parts. Because that leads to inconsistency with data.

That is the big problem that we’re trying to solve here. Eliminate anomalies. We want to also be able to recover information when we decompose, right when we decompose a relation. We don’t want to lose data as results. So can we get the original one back? That’s good. But preservation of dependencies, can we enforce functional dependencies without performing joints? Right? Can we achieve this without performing joints? That is our general goals when we try to decompose our tables and there are ways that we can decompose typically if we choose to decompose into a specific or two particular groups of attributes to hold on to those functional dependencies. So generally speaking here, right, Boyce Codd, normal form decomposition, no anomalies. Awesome. Good green thumbs up there. recoverability of information. Also a good big thumbs up there as well. But unfortunately, sometimes we may lose dependencies as a result. So we can get most we can hit most of our goals, right? We can hit most of our goals with Boyce Codd Normal Form decomposition. And sometimes if we have multiple ways of being able to decompose that relation, sometimes we can hold on to those dependencies, but not always right. Not always

Other Normal Forms

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’ll be taking a look at other normal forms. So so far we’ve only we focused on primarily Boyce Codd Normal Form. And we’ve mentioned third normal form. But if you remember this image here, there’s a lot of other things that we could actually achieve here. So we have normal forms one through five, as well as Boyce Codd Normal form, which is somewhere here in the middle here. So if we are in Boyce Codd Normal form, which is one that we’ve been focusing a lot on so far, then that if your relation R is in Boyce Codd Normal Form, then it is also in third normal second normal and first normal form, but it is not guaranteed to be in fourth or fifth normal form. So typically, what we’re going to be targeting on is hopefully, achieving Boyce Codd Normal Form. In this class, we’re not going to cover the fourth or fifth normal form. These are less common in industry, although you may encounter them and in some situations. But before we continue further into discussing our other normal forms, I do want to cover a couple more vocabulary terms here, particularly candidate keys, which is just another name for a minimal super key. And we call them candidate keys because these are keys that we will use as our primary key for that table. We also have prime attributes, which are attributes of a candidate key name, and non prime attributes. So these attributes do not occur in any candidate key. So basically, we have columns that are part of a key, and then columns that are not part of a key. So let’s keep those in mind as we start discussing our other normal forms.

So in this class, we’re going to cover our in this video, in particular, we’re going to cover normal forms one through three. So the first normal form is set is essentially covering the fact that a relationship only have simple attributes, it should only have simple attributes. This, this, in general means that a table should only have a single valued or atomic attribute or columns. So basically, the value stored inside of the column should be of the same domain. And, you know, of course, we still have all columns, of course have unique names, the order of the data here does not matter. But this table in particular is going to violate the First Normal Form because I Can I have more than one piece of information that’s being stored in the same column, right? So place of origin. And the place of origin here is the you know, Liverpool, UK, right? So city country, and that is bad form, right, this is bad form. So in order to normalize this and make it adhere to the first normal form, we would actually split those two pieces of information to their own columns, origin and country, for example. So anything that has any multivalued column or attributes will violate normal form one. And that incense also makes it very difficult. If something is not in first normal form, it’s almost nearly impossible to make it adhere to a third normal form. Because that data is all coupled together in a single column. This type of normalization is not as prevalent in things like no SQL databases, like MongoDB, for example. But we’ll have another lecture series later in the course, that talks about no SQL.

But this is our our first normal form, right? Our columns should have our column should only contain single values, data, single value data, not lists, not you know, multiple pieces of information like City State, Zip or, or anything like that. So one piece of information, one single piece of information. Our second normal form, here we have a relation for albums, and the attribute on the right, captures the country of the artist, right? So we have album and artist ID, the label for that album and then we take me off of the screen here. And then we also have artist country. Now the second normal form says that As artists nine which is the Beatles is a British Group, all their albums are from the UK. So artist implies country. So, but the country right itself should not be an attribute or should be an attribute of the artist not of the album right. So, the album should not be able to imply artists country. Simply put, though a relation is in second normal form if it is in first normal form, and every non prime attribute of the relation is dependent on the whole of every candidate key. So, this in general violates our second normal form because of this. So, album artist is our our key here, album and artists together imply label and country. So that’s our super key or minimal superkey. But we have artists implies country. And so in order to get this in second normal form, we actually need to split this out into two relations. Here, we have artist album and label and then artist name and country name I just added in there just so we can keep track of whose artists name but this is in second normal form, or as this table is not, so more or less, right? Everything. And if we have a composite if, if we have a composite key, all non prime attributes must depend on the full key, right, so we can’t have any sub dependencies as part normalization level three, right? Non prime attributes cannot depend on each other, right non prime attributes cannot depend on each other.

So here again, we have a relation for albums, and the attribute on the right captures the country of the recording studio. So we have studio implies country. In the same way, if the studio Abbey Road here is or so if the studio is Abbey Road, then the recording could only take place in the UK. And so we might want a table with all studios and countries where they’re located. But here right this violates our third normal form, right, because we have our album artists, which again is our super key are minimal superkey. But we have studio implies studio country and so this violates our third normal form, because we have non prime attributes right non prime attributes that are dependent on each other. So if if when you know one attribute, then you always know the second the second one should be in another table, right? So if that is the case, right? This should be in a separate table, if we know one and we know the other than that should be taken out and put into another table, right? This is just a functional dependency, right? A functional dependency that is a non prime attribute, right non prime attribute, you know, one, you know the other. So essentially, third normal form, we don’t want any transitive dependencies. So every non key or non prime attribute must provide a fact about the key, the whole key and nothing but the key. So if we split this out into a, or if we want to normalize this according to third normal form, we would have something like this, where we split the studio out into its own table, and then we link that and to the other table using a foreign key with a studio ID there. So this would adhere to our third normal form. But those are normal forms one through three and we will cover a little bit more particularly on the differences between third normal form and Boyce Codd Normal Form and following videos. And also remember, there are the fourth and fifth normal forms, but we will not be covering those two normal forms for this class.

Third Normal Form vs BCNF

YouTube Video

Video Transcription

Welcome back everyone. In this video, we’re going to be talking more about the third normal form. And so just read your reiterates our statement that we’ve seen, so far every non key attribute must provide a fact about the key the whole key and nothing but the key. This was famously famously done by William Kinte, who was the database researcher, but this really, really fills out our need for third normal form right we have any any non prime attributes must not be a functional dependency right if we have a non prime attribute that is a functional dependency or implies another column that is not the key, then we have an issue we need to split that out into its own table. So, as a refresher right third normal form, if a relation R is in third normal form, if for every non trivial functional dependency in our where a one through a n implies B, then A one through a n must be a super key if it is not or it is not part of the key then it is not in third normal form. So, if the functional dependency the left hand side of a functional dependency is not a super key, or it is not part of a key, then it is not in third normal form. So let’s talk about some differences here with third normal form versus Boyce Codd Normal form because we’ve covered Boyce Codd Normal Form quite a bit.

And remember, a relationship is a relation R is in Boyce Codd Normal Form, if for every non trivial functional dependency a one through a n implies B, then A one through a n is a super key. Boyce Codd Normal Form is slightly stronger than third normal third normal form, right. So if we bring up this picture here, we have Boyce Codd Normal Form is deeper into that image then third normal form. We have fourth and fifth up here that would be stronger than Boyce Codd Normal Form. fourth normal form has no multivalued dependencies. And fifth normal form says that non trivial join dependencies are implied by candidate keys. But like I mentioned before, we’re not going to cover fourth and fifth normal forms. But big big picture item here that we want to remember is that Boyce Codd Normal Form is slightly stronger than third normal form. An example of this is with this example here. So we have a relationship ABC with a B implies c c implies b remember, this is the example that we had that when we decompose this in Boyce Codd Normal Form, we lose functional dependencies or we can lose functional dependencies, but this here is in third normal form, this is in third normal form, because we have no non trivial functional dependencies or that are not part of a key right that are not part of the key b Right. So c implies b b as part of the key and so therefore, therefore, this is in third normal form, but B is not a super key, right this is c implies B is not a super key. Therefore it is not in Boyce Codd Normal Form. So we have a relation that is third normal form, but not in Boyce Codd Normal Form.

So what does this mean for our decompositions? Right? So we we now know that Boyce Codd Normal Form is the stronger normalization. But with third normal form, we can still recover all of our information that we that we have in the original after we’ve decomposed. And Aha, we do preserve dependencies now. Right this is a big benefit of third normal form versus Boyce Codd Normal Form is that we are able to preserve dependencies however, however, third normal form can still have anomalies, right? We can still have anomalies. So that is a big thumbs down for third normal form. So some benefits right over the over Boyce Codd Normal Form, particularly with this right with Boyce Codd Normal Form, we’re not guaranteed to preserve all of our functional dependencies. But with Boyce Codd Normal Form, we do not have any anomalies, right. So we have no anomalies with Boyce Codd Normal Form, but with third normal form. There might still be anomalies as part of it. But some practical advice here. We’re going to add For Boyce Codd Normal Form, but settle for third normal form. And all practicality for third normal form for most databases for most use cases is going to be good enough. But if you really want to focus down on getting a really well formed database, along with, you know, along with lots of extra, like if you have a lot of data if this is going to be a really large database, you know, obviously Boyce Codd Normal Form is going to help you out there by reducing a lot of those data anomalies. Third Normal Form is is perfectly fine for most use cases, especially if the database is on a much smaller scale. But that’s going to conclude most of our discussion on design, particularly around normalization. So we will have some more discussions on designing tables, not necessarily particularly around normalizing them, but how we might organize and design tables to adhere to certain data, data relationships.