Subsections of Transactions and Indexes
Indexes
YouTube VideoVideo Transcription
What are indexes? So, this is a topic that I, I’ve mentioned indexes in passing a couple times in class, but we never really talked about what they do. Indexes are going to allow are really what allow the really fast lookups for your data. So anytime you have a let’s traditionally, by default, if you do not specify an index, most of the time, your index is going to be your primary key. Because your primary key is your primary, primary column or columns. If it’s a multi key, or multi column key, that is going to uniquely identify a row, and that’s the trickier but these allow very fast lookups for your information that are stored in your database. Think of like a, you know, an array index, right or a dictionary key, those allow very fast lookups of data inside of an array list or dictionary because it can almost instantaneously find that particular position in the data structure and return the value that is stored there. That is the idea here with indexes.
Now indexes, use B trees, physically speaking anyways, that’s the data structure that they actually use to store them. I’ve done it, you may or may not have done B trees in CIS 300, when you took if you took it with me, I’ve offered B trees a couple times as a homework assignment and 300. But it’s not a primary data structure that we cover. Ah, yes, B, B stands for binary. But B trees are a mutation of a binary tree. And I’ll show you what they look like here in just a minute. B trees are very much like a binary tree, but they’re much flatter. Because I can each, we can store more than just two, a node can have more than two children. But so index is our index keys are stored are used for basically, each of the pages that are that store your database structure. So your database is stored and pages in memory, right in theory pages or pages of data. And those pages are organized based off of that index. So your keys themselves are stored in root at the root of the tree, and intermediate levels of the tree, and then all of the data. So all other columns that are not the index are stored in the leafs of the tree. So that’s the kind of trick with B trees is an A B tree, your key of the tree is going to be stored in all nodes. But the data itself is only going to be stored in the leaf.
So that’s kind of the trick, right is that when I’m actually looking up data, the B tree itself, the instead of so when I do a binary search tree, right, your the value in each node determines if you go left or right, right, less than the roots, you go left bigger than the root you go right. But now the value that I’m actually into keying on is whatever the index is, an index could be more than one column. So at the root node here, the index, so the index is stored at the root node and intermediate levels, and then all other columns. So the actual rows of data in your database are stored in your leaf nodes. And now the interesting thing about a B tree is that I can have more than one child, right. But the reason I can have more than one child is that that each element, each node on a level will have three pointers write a pointer to the the node, the child node that is less than me that has data that is less than me, a pointer that has that points to a node that has that has data equal to me. And then a pointer that has a to the child node that has data that is greater than me, and the data greater than me less than me as the index value. Okay. And so that’s what makes it different than a binary no binary tree, and then a binary tree, it’s strictly less than strictly greater than n. It’s only pointing to one thing, right one node that has one piece of data that is bigger or less than me, but here I can have more than one L element in one single node, I could have in theory 1000s upon 1000s of indexes or data elements or pages, in this case of your database, I could have 1000s of pages in one single node in the B tree.
Because inside that node has basically kind of like a linked list, it has a next and previous pointer. And so the next pointer points to the next element inside of that node. And so what the what this up here, right, this, this pointer here, this line here is going to showcase let me turn on my little laser pointer here. This note, this line here, this is a pointer to the page that has an index less than the index that is at the root. This here, this line here is the page that has an index that is equal to the root and theory, kind of how it works. And then this line here is points to the page that has data that is greater than the index of the root. So that is, what’s going on here. And we could have more stuff to the left. And we can have more stuff to the right and more stuff in between. And in between the each of these is that if I basically what this allows me to do is I can scan anything at any point, I don’t have to traverse back up the tree, because once I get to a node, it’s a linked list. So I can go horizontally back and forth, in any direction. Right, that is the big benefit.
And so then once I get down to the bottom here, this would in theory point to any of these pages, if I go to any of these spots down here in the bottom, this index will point to exactly one row that matches the index. Right? Because an index must be unique in general, right. So if your index, if the index that you create by hand is not unique SQL Server will actually make it unique, because that is a requirement of an index. While specific kinds of indexes clustered index is exact. But this is a B tree, as you can see in in here, right? This is a binary tree, in the sense that we start off with a node that has one one data element, one index has a left pointer and a right pointer. But then each of these nodes have more than one data element inside of it. And this is where it differs from a regular binary tree. And you see each of these data values have a has two pointers, one to the to the set of David has less than me. So C C’s left pointer is a B, which is less than C and C has a right pointer that is bigger than it def, that’s also G’s left pointer, which is def, so less than g.
So they do share pointers between each they can share pointers between each other. But this is essentially how I this is a little bit easier to visualize and then the database version that I have on my slides but right but simple data structure very, very, very useful for storing data or for storing indexes for a database. Because B trees are much much much flatter than a regular binary tree. A binary tree by itself fans out very very quickly and gets very very large in terms of number of nodes, which also increases the amount of time it takes to actually search it because the bigger it is the more time to take search even though a binary search tree is actually fairly efficient. But when we’re talking about 1000s upon 1000s upon 1000s even millions of records that begins to slow down significantly in terms of database searches. Okay. So, then let us let’s take a look at a type of index let me shift the screen out of the way okay. So a a clustered index is so basically defines how your data is being stored in the in the database. So you’re again leaf data, the leafs of your B tree are going to contain all of the all data of your date all all data of that table and intermediate notes and the root note are only going to detai only going to contain the actual index you But restriction here is that you can only have one clustered index per table. A, a table without a clustered index is referred to as a heap. Just as it sounds, because it’s all heaped together, right? Because we have no specified structure to it. Now, sometimes, like I mentioned, depending on the database management system, the the database management system will actually sometimes automatically create a clustered index based off of your primary key. But it doesn’t always do that. It just depends. So if you want an index, a specific clustered index, it is very good practice to explicitly define that clustered index.
So this is the fake data that I generated, I just have an order table. We’ve got order ID, source, order ID, order date, customer ID, customer purchase number and order subtotal. And this is all just generated fake data. But I also wanted to show you. So I’ve got so here is my I have a customers table. And here’s my orders table, got an identity column, some simple, some constraints. But notice here, here is my primary key. And I specified that I want this to be clustered, this is my clustered index. Okay. So this is just one way for me to say, Oh, well, this is my primary key. And I also want this to be a clustered index. And so that is what I’m looking at over here. And so all I’m doing here is I’m taking a look at the database objects, sales dot orders, and I’m pulling out, I’m pulling out all of the information about that table. So it gives me a couple a bunch of extra stuff. It gives me the index, index depth. So index depth, this is how deep that B tree goes. Right? How many levels deep in my tree, that this clustered index was forced to go. Now notice, here I have 18 million rows, right? 18 million rows. And so I was able to store 18 million records, and a tree that is only three deep, which is kind of crazy, right? There is no way that we could store a in a binary tree, that much data at that depth. And just be it’s impossible. Right. And so this is also this also shows you the amount of pages that that that has is required to actually store that amount of data. So and then, of course, record sizes and stuff like that. But the important part here is this bit here, right? Where it shows the number of records and the the depth of our actual beat tree. Right? So this is really important.
Okay, so a big point here, right is that for any record that we search for, that is based solely off of our clustered index, so our order ID, it maximum, we only have to search three nodes in our tree, right, it has to go only three deep in order to get to the spot where that row is being stored. And that is huge. That is huge. Okay, so also notice here where I’ve got a little a little cheat here. DBCC drop clean buffers, remember, we talked about that the database management system will actually store stuff in the buffer to make your queries run faster. So you can actually clear out that buffer. If you’re trying to verify the speed of your SQL queries, you can clear that buffer and run your query again to verify that the amount of time it actually takes if it was running from scratch. Alright, so here we go. So notice how slow this query is, alright. So this query, and let me Alright, so this query took about five seconds, right? And so notice, that it also is doing Have a clustered index scan, right clustered index scan, meaning that it’s going to scan it, it has an output. So it’s outputting.
All the columns that we that we want it, it’s applying the predicate that we have. So customer purchase order number on every single row, right. And so remember that we have a whole bunch of rows, right? A whole bunch of rows. So we have, right million, a couple of million, couple million records. So that’s a lot. But if we take this, and run only that, see how quick that that happened. It was almost instantaneous, right? is almost instantaneous. And so this only took, right. This only took a few seconds to actually run versus a few minutes, or a few minutes, but, you know, 510 seconds versus the other one. All right. So this is big, this is really big. And actually, let me let me open up SQL Server Management Studio because it may be a little bit more helpful. I’ll get that open while while we go here. Okay. So how do we create an actual end? index. So this was just an example of the clustered index, which is the base one that we want to create. And remember, clustered index must be unique. Otherwise, it’s going to make it unique for you. So moral story here is just when you’re when you’re creating a clustered index, just make sure it’s on a set of rows that is unique, non clustered indexes. Okay. So we can only have one clustered index per table, but we can have multiple non clustered indexes. Non clustered indexes are very useful because for Well, remember, leaf nodes are going to contain four non clustered indexes, a leaf node is going to contain the row data, and the clustered index.
And it may also contain copies of other columns, which are referred to as include columns. But what this is going to do is the leaf node, and the non clustered index is going to have a pointer all the way over to the clustered index. Alright, so the leaf node, and the non clustered is going to have a pointer to the leaf node in the clustered index. Okay, so the clustered index is how your data is stored physically. And so all the non clustered index is going to do is going to be the same kind of B tree structure. But instead of having the all of the row data stored in the leaf node, it’s going to have a pointer to where that clustered index is stored, and the clustered index B tree. So how does this look, so let us clear our cache real quick. So create non clustered index. And you can do the same thing to create a clustered index. And this is the column that I’m creating my clustered or my non clustered index on. So customer purchase order number. And so now, if I rerun this exact same query, while that is running, I’m going to talk about other variations. So we can can make a little bit of progress here, other variations of it, index or cluster or indexes. So there are unique indexes, filtered indexes, which are non clustered only, and then unique and filtered. So to do so, there are these vary a little less likely. So basically, unique, unique indexes. So clustered indexes are unique by default. All right, clustered indexes are unique by default. Non clustered indexes don’t have to be unique.
So you can enforce them to be unique, if you would like to, and of course unique indexes are going to be more efficient. As far as read go. was okay. And then filtered index operate in a similar manner, I’ll show what that looks like here in just a little bit. But the benefit of this is that we can, basically. So if you have an index is going to do an index for all rows, all values, including non including Knowles, if you don’t want it to include, if you don’t want your index to index a specific kind of value, you can filter those values out and exclude it from your index. Okay, and that is what that filtered is going to do. And then the unique is just as it sounds. Okay. Let’s check back in. Okay, cool. So it took about a minute to actually create that index. This also brings up a good point, right? An index is expensive to create, and therefore to when you create a new record, insert a new row, or update a row or delete a row, that B tree has to be updated. And so that is expected that can be expensive, right? It can be expensive, depending on the number of rows that are modified, that that simple update or insert could cost you a lot of time.
And so there are some cost trade offs there as a result. So if we run our query plan, now our query, notice that this query, instead of taking five or 10 seconds, it’s instantaneous now, now that that clustered index, or that that non clustered index is in place, and if we actually look at the query plan here, notice that we have an a non clustered index here, non clustered index, that then goes into a clustered index. So both indexes are employed here, because the non clustered index points to the clustered index, because we’re looking at the order ID. So the non clustered index takes out the customer purchase order number, which then pulls the full order, right, because all the data is stored at the clustered index. Right? Remember, all the data is stored in the clustered index. But now, if you look at this, if this this non clustered one, it’s a seek operation, instead of a scan, this is a big deal, because a scan is going to be applied to each and every single row, I out all 13 million rows would be scanned for this value. But now that those values are indexed, I don’t have to scan all rows, I know where those values are actually stored. And so I just have to seek, I have to just jump through the tree to get to that specific spot where those values are at.
That is why this is so much faster, is instead of having to scan all rows, I know exactly where that data is stored. And so I don’t have to search the entire table to find that data, I know exactly where it’s at. That is the beauty of those B trees for indexes and databases. Okay, so you can take a look at the physical aspect of it here. Notice now this is a non clustered index. Again, 18 million records. But now I have four, I’m at depth for now. Right? So it’s not one, it’s not 100% arbitrary of how deep these things go, it’s all depends on it’s how B trees are actually implemented. It depends on the values associated here, which determines how far the B trees have to be split out. But what can we do better? So, notice, here I’ve got include, okay, so I can actually add columns to this. And again, this is going to take a minute or two to actually run. And so I’m going to start that. And then I’m going to flip back to the slides while that’s running. Because we’ve only got five minutes left, okay, so some general recommendations here for indexes. So, clustered keys, right clustered clustered index is used as the reference and all non clustered leafs, right. So if you have a non clustered index, they are all going to utilize that clustered index, right? It’s just how things work, right?
So in that sense, make sure you choose your clustered index wisely. Right. Since everything is going to utilize that it needs to me, it needs to be efficient, it needs to be useful information. So don’t Create an index on something that is just arbitrary. Right? You should be creating an index on data on a column that you’re going to be searching through searching by very frequently, right? Otherwise, there’s no point into it right? If it’s not going to be used as part of a query, then there’s no reason to actually index it. If it’s very used very rarely, right? So clustered keys, in that sense, should be ideally, not changed very often. That’s why it’s usually the primary key, because their primary key for each row does not change very often, once it’s been inserted, it usually doesn’t get updated. Right, because it’s the primary key, of course, it must be unique. And if it’s not, SQL Server will make it unique. But in that sense, make sure the columns unique. That’s why we use the primary key here, because otherwise, it’s extra cost, because SQL server has to take the time to create a unique value for it and place of the value that was there. Be sure that it is not a lot of data associated with it. So a clustered index shouldn’t be five columns long, it should be as few columns as possible, because it’s referenced in and pretty much everything, right? And so it should be small amounts of data associated with it. Okay. So incremented data works very well, for this sort of thing, right?
Remember, our incremented surrogate keys that we create all the time, right, the 1234, the identity columns, identity columns work fantastic for this, right? Because typically, they’re sequential, and they’re automatically incremented by themselves, so you don’t have to mess with them. And so those are one of the best things that you can actually do a clustered index bite, because it’s very little data, it is unique. And it is typically a primary key. Not all the time, but typically, right. So that that is a very big benefit there. So just some, just some general considerations here. Indexes are here to make your queries run faster. That is their goal, alright, that is the reason why they exist, you do not have to specify one as part of your database. But the main trade offs here that you need to consider are are for your use case, or do you value fast writes, or fast reads, if you would rather have faster reads, indexes are the way to go, alright, because that is going to significantly speed up your queries. But if you write more than you read, then indexes should be used lightly because writing more more and more data at a time, it will make those writes very slow, because it has to update the index each time. So that is a big downfall of that. So notice before we’re not when I did this over here, there was actually an internal join.
Because I actually have order date and customer ID. And so since I have ordered date and customer ID, I had to look back at the clustered index in order to pull that data because the non clustered index only contained customer purchase order number, that’s all it contained. But now that I added customer ID and order date as an include, for the customer purchase order number index, this query now does not have to actually go back and look at the clustered index, because it already contains a reference to order ID, which is the clustered index key. And now it also includes order date and customer ID, because I added those include columns for the non clustered index. This takes up more space, mind you, right, it takes up more space because that data is now duplicated in two different pages. It has customer ID order date is in the non clustered index, and it’s also in the clustered index. So it does take up more space. But if our query is being ran frequently, it becomes much faster, much, much faster.
But that is all I have time for today as far as the tutorial, running the examples here go. So I encourage you run take my setup, query, run it, and then you can go through here and and create and see the differences between the indexes. All of these indexes down here are doing is I’m just showing you how to create unique indexes, how to create filtered indexes and that sort of thing. That’s all I’m actually doing down here. That’s really the only example that I haven’t been able to get to today.
Transactions & Concurrency 1
YouTube VideoVideo Transcription
Let’s check out transactions. So as I mentioned before, transactions are kind of like a receipt for any of the queries that you actually execute for your database. Typically, if you’re working with data that is just on your local database transactions are a little bit overkill, it must that data is actually very sensitive. And then transactions are needed to prevent any loss of data in transit. But when you have a bunch of people using the same database at the same time, there are more than likely going to be collisions from time to time. So more than one person tried to read or to write, update or delete the same record. And things get a little bit wonky, right, because you can’t update if you update a record and try to delete a record at the exact same time. Right, it gets weird. So that’s why we need this for multiple users. But the transaction as as a whole is one unit of work.
So an update a delete an insert. Typically, it is something that is modifying your database, we don’t really typically run things Transact transactions really aren’t necessarily needed for things like select, although you can wrap pretty much almost anything in a transaction. But it’s not necessarily needed for things that aren’t modifying, aren’t modifying your data. So, so each each SQL statement that you do is a transaction essentially. But we can also define those explicitly as part of our SQL code. And I would err on applying those explicitly. So it’s very clear that you’re working with a transaction there. And not all database, not all database languages work that way as well. So most most, most languages require you to explicitly define your transaction. So and that’s kind of what we primarily focus on here. So if, if my demo was working, I would have a demo to display acid here.
But acid is a acronym for atomicity. Alright. So essentially, what we’re dealing with here is what we’re acid allows us to guarantee integrity of our data. And this has nothing to do with the design of the database at this point, right? We’re not trying to modify or design the tables in a certain way, this is dealing with actually the sequel that you’re running. And so if we’re running an update, and that update impacts more than one row, what if you update five rows out of 100, and then something errors out the disk errors? power goes out, internet goes out whatever the connection fails? That’s an impartial update, right? Only five of the 100 records actually got updated. And what if you assume as a user that all 100 Records got updated? Because the query ran so? Right? Because a lot of the a lot of errors that can happen that interrupts SQL commands being executed fully, those errors aren’t necessarily reported back to the user? Not necessarily, right? So the atomicity here is that all will succeed or nothing at all, right? So for a transaction, everything in that transaction succeeds.
Or we roll everything back and we reverse, reverse our process. Can consistency. So see, for consistency, this ensures that our data is valid from one state to another, right? Meaning that have like with our updates, right? If we are, if we are openings, opening a connection and working with a table, everything is consistent in terms of that connection, the session, the batch that we’re executing, between updates, and deletes, and so on and so forth. Isolation anything that is concurrent, is independent. So if two transactions are running on the system at the same time, those transactions care nothing about the other other transactions that are running. And this is very important, right? Because without isolation, that means each transaction is dependent on another and that would then incur more issues with acid because we’re not If we’re not ensuring consistency, what if one transaction doesn’t finish? And the other one does? Does that mean we have to roll back everything, it becomes a significantly more complicated problem when you have transactions that depend on other transactions. So each transaction should be done in isolation. And then last year durability, so anything that we change should be committed. Right? Anything that we change is changed. And that’s it, right?
Basically, if we make an update, and the transaction succeeds, then that is a committed update, right? This kind of like, when you’re committing your code to your repository, Git repository, right? You can you commit your code, you push it to the server, that is your transaction for, uh, for storing your code, right. And once it’s committed, that change is not lost. And that’s what we’re trying to achieve here with these transactions. So acid is extremely an extremely important idea for databases that are going to be especially something that is used for anything more than one user. And, quite honestly, I’ve never had a database that hadn’t had more than one user, at least in a production sense. Likewise, most databases anymore, are connected to some web web site or web app. And you have hundreds if not 1000s of concurrent users, depending on the site that you’re actually working with. And so if all 1000 of those people ran a database update at once, on the same table, things get kind of crazy. And so if you are ensuring acid, then you shouldn’t have any issues with any data consistency, as part of that. Cool.
So locks. So locks are used to control access for a transaction. There are two main modes, for locks, exclusive, and shared. So so the example with a four exclusive, for example, if you are trying to update a table, and you have five users that are trying to update the people table trying to update their addresses, well, maybe maybe person one and person two are trying to both update the same row, you can’t have them updating the same row with the exact same time, right, because otherwise, one, one may clash with the other, one may overwrite the other, it just becomes chaos. So we add in an exclusive lock on that table. So that one transaction at a time is actually doing the update. So you kind of the database management system ends up building this queue system. So as multiple transaction comes in, and there’s an in the table that they need to modify as lock, they basically have to get in line in order to basically a priority queue Have you seen from 300. So they get in line and to actually execute the code or sequel that’s part of their transaction. Shared is just as it is, right? A shared lock is used for primarily reading. So this is your select, right? So you can run transactions with selects. And most of the time, if it’s a read only operation, then you only need a shared lock.
So you basically, you can share access from one transaction can share access to another transaction to a table at a time. But it is kind of important for shared locks. Because if a if a if someone is let’s say we are reading we wanted to read from a database read from a table, and then you have another transaction over here that is trying to update that table. Well, if they update the table in the middle of your read, then all of a sudden your read your SELECT statement becomes invalid because your data that you get is not correct. And it’s either not up to date or partial part of it may not be up to date and some of It may. And so it’s not current with the with the current state of your table. And so that’s why we do use shared locks for select clauses for read statements, because you don’t want updates to happen while it’s being read from, but multiple reads can happen at the same time. That is okay, because 500 Different reads are not going to impact each other when you’re reading from the same table. Right? That’s kind of the benefit of a database, but updates can and interfere there.
So the exclusive locks in in nature are held until the transaction is fully completed. And once that transaction is complete, then that lock is released. And if there’s another transaction in the queue, that needs to also lock that table, then that lock is then passed on to the next transaction. But share locks are very similar. Shared locks are the table is locked, until all re all transactions that are reading from that table have been completed. And then once all all those read transactions have completed, then the shared lock is released. And then an exclusive lock or another shared lock could be placed on that table. So if you have or have not taken 520, you may have learned a little bit of this sort of of parallelism and 450. A little bit maybe. But operating systems do this sort of stuff all the time. Specifically for threads that are executing under computer fighting for memory and disk resources and things like that. So there’s queues and locks and everything is set up because you can’t have you can’t have more than one thread accessing the same data at the same time. If the if some of them are or reading and writing at the same time. And so a lot of a lot of things in computer science use this technique to control access to data and and the modification of that data.
So for a shared lock, a shared lock will block all other transactions that are requesting a read a shared lock, right, so any other if there’s any other transaction was trying to read from a table that has a exclusive or a a nother transaction that is requesting another shared lock? is good, right? But if it is an exclusive lock, right? That will be blocked, right? So a shared lock will allow other transactions that need a shared lock. So if there’s a shared lock on a table, and one transaction is reading, and then a few seconds later, another transaction comes in and wants to read from that table. That’s okay. But then if an if another transaction comes in and requests an exclusive lock in order to update or delete, that is not okay. The exclusive lock will block all things, right. So basically, no transactions can modify the same rows at the same time. Right? Kind of we talked about as it is, right? Readers, readers, block writers and writers block everything. So anytime you’re trying to do a read, you can’t do a write. And anytime you are writing, you can’t do anything else.
Blocks happen at the table level. Yep, yep. And that also implies that let’s say, you are updating the people table. And but you have a different query that is updating the address table. But that update requires a join to the people table, that is not going to be allowed that will be blocked, because the people table is in modify, as in is beginning to be modified. And therefore if you join on that table, that join becomes invalid because the data is inconsistent. It’s in the process of being updated. And so that join can’t happen until that lock is released on that table. So yeah, so locks are not just single table queries, but they also impact queries that do joins. And as soon as you start to have consistency issues, that’s when your database becomes. I mean, I shouldn’t say useless, but for that user, it becomes invalid. Right? You have to go in there and I I’ve run run into this many times before. And in early days, when you get a database that is inconsistent. Fixing it, you have to fix it by hand and fixing it by hand can be very painful, depending on how big your database is.
It it’s just not fun at all. So transactions are awesome. Okay, so consistency issues, what can what can what can happen? Well, reading data that is not yet committed. So what if you do an update, but that update has not fully finished yet. So you could, without a transaction, do an update the date the table is partially updated, you do a select, and it pulls all of the data, maybe the Select runs faster, and it pulls the entire table really quickly. But only that update has only impacted the first five rows, and it’s still working on the rest? Well, that’s the data at poll is now inconsistent with what the what the update is actually trying to finish. And so that is a data that has not yet been committed. Secondly, what that too, right, what if that update updates five rows, you read it, but then the update fails. And so and so you want that transaction to roll back, reverse what it updated, because that would be an so the data stays consistent. And so then you have, you still have a dirty read because you’ve read data that was not committed and rolled back. So basically, you’re you’re sharing code with with a friend that has not yet been pushed into your remote repository. That’s kind of the issue there. And then when he tried to merge, then all your merges fail, because you have conflicts and yeah, dirty reads, not very fun. You have non what we call non repeatable reads. So you read the same row, two different times, but get two different values. Right. So you do a select, and you do a select again. And the data is different. That is an inconsistent read, because an update may have happened in between, or during your read. And so you did a read. And then an update happened while you are reading the table. But the records weren’t there yet. And so then you go back and do it again. And things are different, right? Or maybe maybe data roll back. There’s a lot of issues with that.
And then you get what we call phantom rows. So you read a set of rows, but you get different sets each time. So this involves issues with transaction if you don’t use transactions for things like deletes and inserts, right, so you read from a table in the middle of an insert or delete. And so you get extra rows or less rows than what you should two times after you’ve read it. So definitely all issues, less of an issue for general user data, but even more of an issue for a web app that is dealing with financial information, ecommerce, shopping, that sort of thing. Transactions are extremely important. So you have something so you, your website says that you have 10 of this item in stock, and you add it to your cart and go check out. And by the time you check out you find that oh, well, there’s really not any that it’s out of stock now, right? Because when you added it to your cart, there’s five other people who tried to add it to their cart and ran out. So issues with that. So transactions are extremely important in that sense. So each of these, each of these elements here on my slide here, I’m going to have a demo a demo for you.
But we’ll talk about these in theory. So you isolation levels. So there’s a lot of different levels that we want to treat as far as our SQL and isolation that we talked about earlier and acid right. So the eye and asset is isolation. So we read uncommitted data. So basically this is issue with data consistency. be reading committed data, which prevents dirty reads repeatable reads which prevents dirty and non repeatable reads. And serializable, which prevents all consistency issues, including phantom rows. So these are different levels of isolation that you can, you can employ while creating your transactions and running your SQL code. So basically read uncommitted is a select without any sort of transaction involved. And then you have different levels of transactions that you can enforce as part of this, right. So but again, I will show keys, I will showcase each one of these, hopefully it will make sense or more sense, once you see some code examples. And you can see the sequel that I’m going to show you is already online. But it’s just not going to be able to run because I couldn’t get it set up on the remote server yet. Okay. So let’s see here. So a little bit more detail here. Right, for your uncommitted read yet no locks, right. So the transaction that you the transaction itself, the select, you read from a table without locking the table first. And so without a lock on the table, there is no way you’re going to be able to ensure consistency of data. On the other hand, you can do a read.
So READ COMMITTED read uses a shared lock, but releases once the read has been completed. So right, that prevents dirty reads, which is perfectly fine, right? But it doesn’t prevent non repeatable reads. Right? Because I could, once I released the lock, then another another transaction has the capability of going in and updating that table. And so therefore, if I read that table again, then it is non repeatable, right? Because the data has changed. So a shared lock does not guarantee all consistency. But it does guarantee consistency within that single read. But it does not prevent consistency issues across multiple reads. If that lock is released, right? So if you have the same query, but you really Slock there’s a gap. And so if you have a if you have a gap sandwiched between there, again, right, you can have an update or insert or delete in between. So Repeatable Read, this is shared locks, but the lock is held until the transaction is complete. So difference between these two, right? What if we have a transaction that has more than one read as part of it, which is okay, right. So but if if the lock is released between the reads, that becomes an issue consistently within the read but not consistent within two reads. If you hold if you have two reads within one transaction, and you hold that lock until the transaction is done, then you can guarantee consistency between all reads for that data within that transaction. Just better, better. Serializable is a little bit more complicated.
This uses what we call shared locks and what we call key range locks. So this is going to prevent anything from being inserted or deleted from your table. So this is the kind of the top level that we’re looking for here. And for the most part I committed read is going to be a committed read or repeatable read or is going to be more than enough for most cases. Serializable isn’t really necessary unless you’re doing a large transaction with multiple things in between. So, but again, I will I will show examples of these. Hopefully it will make a little bit more sense. And so what I asked for you On Friday, I know, Friday’s classes not as well attended the past few weeks. If you wouldn’t mind, please do come to class, or join me live on Zoom. But we have to have five or 10 people actually show up live, whether it be on Zoom or in person. So we can run multiple transactions at the same time.
So we can simulate these sorts of things. What was just me doing the demo? It does nothing. So we’ll have to have Yes, I will have you run queries with me. So if you do have a laptop, please do bring it to lecture in person. Or if you’re at home, you’re already on your computer on Zoom, unless you’re on your phone. And otherwise, please use your laptop presume that time. Yeah, yeah. So cool. snapshotting, I just have a couple more slides left here. And then I’ll talk about the exam. So snapshots are the same consistency as serializable. But writers don’t block readers, and readers don’t block writers. And so in general, we have much improved concurrency as part of snapshots. So what we get with snapshots, right, what we get with snapshots is really, really kind of cool. And this is, not all databases do this. But most do. So snapshots, snapshots are going to allow riders to go and readers to go without blocking each other.
And so how does this actually happen? Well, rather than using so most of what happens with locks is that we are locking based off of row by row, right, because that’s how we tell what records or records are in a table. So the lock happens at the table. So basically prevents any, a exclusive lock, for example, is going to prevent any row from being modified, or being modified from anyone else other than whoever holds the lock gets an exclusive lock. Shared lock works in a similar way, it has a lock on the tables, rows itself, so all rows, but with a snapshot, we’re going to attach a version to these rows. And so the readers, if we’re using a shared lock here, readers get a version of the data that existed at the start of the transaction. So the table itself is snapshotted. And that snapshot is given to the reader that’s doing a transaction. So the reader gets a copy of a version of that table.
Okay. And so this is this is expensive, right? That’s it’s not cheap, right? If you do a snapshot of a table, and your tables, hundreds of 1000s of rows. It’s a lot, right. So it is this, this perfect, this, this better version of concurrency does come with a price. Okay. So the versions of each of these rows are then tracked in the system, temporary database. So we did we looked at Temp DB, we’ve already played around with that a little bit. And so when a when a row is modified, then a new version of that row was created. And so basically, what we get here is consistency within that snapshot. consistency within that snapshot. Although, of course, this is expensive, right? A lot more disk IO. And disk I O is is the worst part, right? CPU is of course expensive. But disk reads and writes is by far the slowest thing on your computer, or any server, right? And so if you have a lot of those, if you’re snapshotting an entire table, it can get quite expensive. So there’s always trade offs, right? Snapshots give you extreme concurrency, right not well, I shouldn’t say extreme but much better concurrency versus the SERIALIZABLE isolation levels. But it is very expensive it is it requires a lot more CPU and a lot more disk IO Um, that being said, there are different types of databases out there. And I’m hoping we have enough time to cover it this semester. I would like to try to cover no SQL for you. But we’ll see if we’ve got time at the end of the semester, I’m not sure.
But things like no SQL, a lot of times is treated as a distributed database like MongoDB, and things like that. And distributed databases do this sort of thing extremely well, right? concurrency. And so no SQL is banking on hundreds of 1000s of our 1000s of reads and writes are hundreds of reads and writes. At the same time, it’s designed for that in mind. And with the database being distributed, you don’t necessarily have to take a snapshot of the entire table, you could take a snapshot of a section of the table that that query is actually working with. So there are some improvements and certain kinds of database management systems depending on on the technology you’re using.
Transactions & Concurrency 2
YouTube VideoVideo Transcription
We’re gonna do a class activity today. And so what you need to do this is on the lecture 27 transactions and concurrency, you just need to download the transactions in currency dash participate dot SQL, zoom in here. So the transactions and currency dash participate dot SQL, that’s the one you need. And it will look like this. So this is the participate dot SQL file. First, the first example that I’m going to showcase is the blocking issues. So the fact that we have remember shared blocks, which are specifically for a read operation, and so multiple transactions can share that lock. And multiple reads can happen at the same time. Or multiple reads can happen within that within that. And then exclusive lock blocks all so an exclusive lock will block all reads and all other rights. So that is what we’ll start off with. Okay, so on the left here is just my my demo as far as what we’re trying to showcase acid for. And you can also download this too, if you want, but you won’t be able to run one of these particular commands. So. So I’m just emptying out my table, from what I had from us testing this.
So remember, that’s the truncate, truncate drops all the things, it doesn’t drop the table, but it drops all the data in the table. What what will actually have as part of our table here, so I have this table called transaction test. This is just going to showcase records. So when I when I create a transaction, it’s going to low write a row to this table, so you can see what transactions are actually being current. So that’s kind of the point of that. So let’s do this with a transaction. So we’re going to execute this. Okay. So there is the begin try, right. So I’m actually doing a catch, right. So this is like a standard, a standard issue, right? Because remember, this, this throws an issue because I can’t insert No, it doesn’t allow no because the column column constraints. Okay. So if I tried, if I wrap things into a transaction, and any of my sequel actually throws an exception, I can actually do a rollback, right? So a rollback allows me to reverse what I actually tried to insert. So now, if I check out my transaction test now, even though I did initially insert a row, there’s nothing there. Because the row was inserted, this one errored out. And so I hit my begin catch, block my catch block. And so I rolled back my transaction, my transaction was not committed.
Yes. Well, it does. It does, right. Well, because it did go to the cache right right away, but this is the sequel that that through it. Yeah. So it didn’t it didn’t commit the transaction. And if it committed the transaction, then all the all of the operations that were done as part of the transaction would be finalized. That’s what we do with commit. So the rollback, well, no, none doesn’t go to the previous commit, but it reverses all of the statements that happened in this transaction. So any rows that were modified, updated, deleted, inserted etc, will be reversed back to the original state before the transaction started. Or when the transaction started sorry. So that this is the My Favorite, like the best thing about transactions. Locking is great, but rollback is is awesome. It will save it will save a lot of headaches with inconsistency of data. Yeah. You over again with locking. I understand that so so locking, so a transaction will lock a table that it’s working with. And so when we lock a table we have two types of locks, shared locks and exclusive locks.
Shared locks are tip are for Anything that is for reading, so select clauses. And so a shared lock will allow more than the same. So it’ll allow multiple reads on that same table. So that table is locked using a shared lock. It’ll allow it allow reads, but it will not allow rights. An exclusive lock will block all the things. So an exclusive lock is for when a table needs to be written to or updated or deleted. And so that will block any other any external reads or writes or updates. next little bit here. So let us let me empty out the transaction table. Okay, so now, I’m going to start my transaction, I’m not going to finish it. So I’m gonna start my transaction here. So my transaction has begun. And so I can actually showcase here. Ah, so notice, notice my query has started to execute. Right started to execute, but it hasn’t finished yet. Right. And so you can also if you want, go ahead, and you can actually run this stored procedure too. And so if you run that stored procedure, I can actually look at in theory, everyone who is actually running that stored procedure at this time.
And so here are all the people running that particular stored procedure, this is the number of transactions you actually have running right now. And this is how long you have that transaction has been active, but no one can actually access. So this, this query, or this, if we go show, show transaction tests here, here we are. So all this is, is a transaction that basically this is just capturing you as the participant, and then selecting everything from the transaction test table. But I can’t select from that table, because I have locked that table due to my transaction here. Because I am inserting into that table, an exclusive lock is automatically placed on that table. So you don’t have to explicitly say, lock this table. If you start a transaction and execute a SQL statement, the the unnecessary lock to do that statement in a transaction will be placed on that table by automatically. So now, if I Oh, yeah, question, so you don’t have to worry. So now, if now I commit, right, which will allow the results of my inserts to be released to the world, right. So I’ve been working on doing some updates to the transaction test table. And when I commit my transaction, my lock gets released. And so the locks are released when the transaction is completed. So either committed or rollback.
So if I run my commit, everyone should now have this row show up for them. The message that I sent here with Hello. And so if I, if I delete that, and I’m going to run this transaction, again, with a different message, and then again, you can run that show transaction tests. And then if I again show, so we’ve got another four people. And now if I roll back, you shouldn’t see hello to because this is this has been inserted into the table. But it’s that table is not accessible by anybody yet. But when I do roll back, the insert is reversed. So the the record the record records that are inserted or removed, records that are modified are returned to their original state. So if I roll back, we get no rows as a result, right? That is because our Insert was not committed. Let’s do some now issues with consistency, right? Because consistency is a pretty big deal. Especially when you’re dealing with sensitive data and transactions and information that is being displayed and utilized on the web, ecommerce banking, that sort of thing. Lots of lots of reasons why we want consistency with data. So, here I am going to run this live query that is going to again, working with the transaction test table, I’m going to insert a row. And then I’m also going to update a row, right. So I’m going, I want to insert this test inserted row. And then I want to update that same row to a different value. So let’s go and start this transaction here. One row affected, total execution time, right. But the, but it has begun.
And so if we now run the consistency issues one, so run with READ UNCOMMITTED, and your again, welcome to view this, okay, so we get, notice that you can actually run that query, you can actually run that stored procedure, right? It doesn’t really showcase, no one’s actually even locked out as part of it. Okay. But if I roll back and run this, right, everything goes away. But the problem here, right, with the dirty read is that when we actually run our statements over here, we don’t actually get the latest, the latest result, right, we get the record. But two things have actually happened as part of this, we’ve had an insert and an update happen. But notice that the value is not updated value yet. It has not updated value yet. So this is a issue with a concern with consistent data, right? An update has been ran. But you don’t get the latest update. Hey, so this is a big issue as far as data consistency goes so well, it was technically yes, because I haven’t committed the transaction and the transaction hasn’t finished yet. And so you were allowed. So the table gets locked, right, the table gets locked.
And well, partially, but the data gets inserted. But when you actually run this read uncommitted, I have not actually committed my updates yet. And so when you read the read from that table, it’s not locked. And so then you can actually pull the date the old data that was there. But the data that we have updated is not there yet that is available to be read. So that is the issue here with the with the consistency consistency with reads. And so this is what we call the dirty read, right? You read data from the database that was not consistent with what is actually in the database, or what is current in the database. So So with this one. And I forgot I forgot to take one name out there. So if you so if you’re connected to my database, again, I don’t think this should have to run. I’m not sure why this was working last night. Last laws of demos. Anyway, so I’m going to begin this transaction. So this is a transaction for a update. So run this There we go. So now we’re blocked. Okay. So now if you want to run the fix with the run with READ COMMITTED. So we’ve got a few people, few people running now.
So now that we’ve got the the read committed. So now if I roll back, right, so in theory, if I roll back here, you don’t actually get so we have we have we don’t have the value, right, so this is the value that I was trying to update. But it didn’t, right. So I had a row that was Test for Update test updated row. But since I actually rolled back my transaction, you did not get the updated data, which is okay because that is now consistent with actually the with the actual data. And so why is and so why is why does this fix the issue with repeatable reads or dirty reads? Well, if we navigate oops, navigate to that stored procedure. Notice this line here, for that stored procedure. We’ve set the isolation level to READ COMMITTED So that takes care of that dirty read issue, as part. So the isolation level is where you’re going to be able to fix consistency issues.
Transactions do not guarantee fixing the fix of consistency, right transactions don’t guarantee consistency. You have to set that with isolation levels, different isolation levels provide different benefits as well as limited levels of concurrency as well as cost, right, which will show with snapshotting. Okay, so that’s what we just did the the, we showed how a uncommitted the, the isolation level of READ UNCOMMITTED, get gives you dirty reads. How READ COMMITTED fixes that issue. And now let’s do repeatable reads. So, issues with repeatable reads, the next step here is the repeatable read. So all this, all this stored procedure is doing is doing one read after another, right. And so you can take a look at that here. So run Repeatable Read, notice that we are setting the isolation level two repeatable read. And I’m actually delaying for 15 seconds, and then doing another, another read. Okay, and so I do one read, and then I do another one after 15 seconds. And so let us do that. And so if we run if I start start my transaction here. So execution time, but let’s do the run with repeatable reads.
So do that started inserting. So you can run this, you can run this non repeatable reads here. And then see how many. Okay, cool. So if you run that, and then now if I roll back, finish my transaction, but 15 seconds, give or take, there it goes. Cool. So because because I was locked, right, we were locked out. And so this is this allows a repeatable read, right? This allows a repeatable read. But the problem is, so the problem here, that down here is going to be a little bit different. So up here, this was just one single row that was updated and changed. And he ran two reads after so it was was prevented from actually being different. So the idea behind a repeatable read is that you should be able to run the read in one single transaction anyways, in one transaction, if you have more than one read, both of those should be should have the same results. Right, that’s what we’re after here, go ahead and do the run with repeatable read again. So I’m going to start running mine. So that started running, and then I’m going to actually go over here and run an insert and update while that is going. So let’s see what happens here. Ah. So the repeatable read in part still. So I’ve got here’s my before, here’s my after 15 seconds. All right. So both of those are accurate.
But now I have a phantom, what we call a phantom row. So within a single transaction, to read to read to two identical reads, produced different number of rows, right. And so that extra row is what we call a phantom row, right? Data that was inserted in between our rows, but I didn’t have the proper lock associated with it to prevent that from being inserted. Or sorry, proper isolation level. So so let us fix that issue. So I actually have a run with Serializable is the next one that will do. But let me execute my initialize data. And so now we’ll want to run our different level here. So you want to run to the run with serializable, which is the isolation level. If we go down here, the isolation level is serializable. So we’re down into the next level, the last isolation level here, right? So this is preventing all consistency issues, including phantom row.
So this is basically the, the most isolated we can get, right? Because this is the idea when we’re running with SQL, right, we’re running SQL on a shared server. And so if we want things to be consistent, we want to run our stuff in isolation without anybody else interfering with our stuff. And that is the idea of the isolation levels. Okay. So go ahead and run the run with serializable. And then as you’re running that, I’m going to do another insert, right? For for this? Uh, huh. No more phantom rows, no more phantom rows. So this is good. This is good, right. So our isolation level was done in a way that changes that happened outside of me did not impact what I was doing inside of my little, my little bubble, right. And that is important for consistency. We do have kind of the ultimate tier of snapshot isolation level. And remember, snapshot isolation is very expensive, because it takes snapshots of your database at the time it’s actually done. And so it’s, it’s good, depending on what you’re working with. So if you have something that is very, very sensitive, very important, that sort of thing. Snapshots work very well for that. Okay. Snapshots work very well, for that. Yeah. Essentially, essentially, right, they get more Yeah. So as you increase your isolation, then you have you.
You can lose things like, you know, some some things require more more resources to actually accomplish, especially snapshotting. snapshotting, is the one that’s really expensive. And that’s most of the other isolation levels don’t you don’t see too much of an impact. But they can be slower, in general to actually execute because everything is done in isolation, right? So. But that also is, in some degree to some degree, concurrency can take a hit, depending on what isolation level you’re actually working with. So especially with this particular, right, right, we have serializable, locks a lot more locks a lot more things as part of it. And so that is, that is, one of the reasons why it is so useful is that it does use you shared locks, but still allows things to be the consistent consistency that it actually losses, each of these actually do share locks, right, because you want to have be able to run more than one thing at one time. But the important part here with repeatable reads, And Serializable is it holds that lock until the transaction is fully done, not after the actual SQL has been completed.
That’s the big difference between these. So and then snapshots are better in this sense, because all of those are still having locks, locks placed on tables. And so concurrency can’t happen. Because when that lock is on that table, you all can’t use it, right? Because I have it locked down. So that’s where we get more benefit out of the snapshots is if concurrency matters, right? If concurrency matters, then snapshots are the way to go. But they are expensive. They are expensive. So So let’s, let’s run that. So if you do the run with snapshotting here, and then I will run my insert. Notice that my insert happened instantly. And my my run was snapshot still going. But because it was snapshotted. Right? Because it was snapshotted. It has its own copy of the data. So it doesn’t need to block anybody. Right? And this is where things are really beneficial. Because if you have a long running query that takes 15 seconds to run, which is not necessarily completely abnormal, depending on the size of your database. Then if you’re running that query underneath the transaction and high isolation, then you’re blocking everything from happening, right. So imagine if you if I added something to my cart and on Amazon, and you couldn’t buy anything on Amazon because I was buying something right.
That’s the big issue. right with some of these isolation levels. So snapshotting prevents that sort of thing. But of course, with really large production databases like Amazon, it’s a distributed database. And so these things are spread across very large servers and have it’s the sheer size of these databases that run those sorts of commerce sites. is quite impressive, but yeah,
Transaction Log
YouTube VideoVideo Transcription
This topic I’m going to be mostly skimming through, it’s important to be aware of, but not, I’m not really going to be testing on it. It’s more of a, what happens underneath the hood sort of thing. So we talked about transactions last time, right? Transactions are a essentially a receipt for any, any SQL that we actually execute on our database, including selects, updates, inserts, and deletes. These are really important, because that allows us to enforce acid, which remember animosity. So either all or nothing. So we can’t have if we do for updating 100 rows, acid implies that we can’t update 50 of them and let the other 50 Just go out into cyberspace, right? So consistency, which is really important. So think of a database as as just a bunch of states, right? When we do an update, we transition from one state of the database where those updates don’t exist to a state of the database that those updates exist.
And we don’t want any in between states, right? We want one single transition transition between those. So if we insert a row, we will go from not having that row to having that row, nothing in between. And so that’s where we get our consistency of data. And with our transactions, we showed how, if we don’t enforce acid, we can see those inconsistencies and happen in between if a read or something happens while a write is in process. Likewise, we talked about isolation. So we have different isolation levels that have different levels of properties. So that prevent dirty reads phantom rows and things like that. But if we execute our SQL and pn and full isolation, then we have no issues whatsoever. But of course, that does come with a cost of less concurrency. And so you do have to play the balance between those. And then we also talked about durability. So if we do make a change to the database, can we be certain that that change stays, right? This is what happens when we commit a transaction. Because in a during a transaction, we can actually execute our update or insert or select or whatever.
But if it’s not committed, the not transaction, the SQL inside of that transaction hasn’t officially happened yet. So that is what we get with durability. Okay. So acid in general is extremely important. When we’re talking about SQL, relational databases, no SQL, like Mongo have different ACID properties. So if we get time to talk about no SQL, I’ll bring that topic back up. But in the big thing that we look at here, consistency and isolation in terms of acid are accomplished. Oh, yes, question. It depends on which NoSQL database you’re talking about. Different no SQL database engines have different properties, that that, that forgo some asset, some acid property for a foreign other. Most of the time, no SQL will abandon certain properties or lessen them less than the restrictions on them in favor of high concurrency or high availability. That’s typically what no SQL goes for us. Super high concurrency and very high, high availability.
And and relational databases at times can limit as we saw with transactions different, depending on the isolation level, will limit the concurrency and limit the availability of the data if a transaction is running, right, depending on depending on the setup, so yeah, no sequel will favor concurrency and availability over over most things. No sequel is valued for its fast, insanely fast read. Not so much for REITs. Right. So no sequel is mainly for reading data instead of writing data. But again, yeah, we’ll talk more about that later. So an acid consistency isolation, primarily dues used for locks. So we’ve talked about shared locks, exclusive locks, and a little bit about row versioning. So especially things like snaps sharding and the extreme case, right, we can snapshot something. And we have one version of the database that we run on run one transaction on, versus a read, which has a different snapshot that it can run its transaction on. And the atomicity and durability are complex using what we call transaction log. Right?
So when we talked about last time, I didn’t really talk about durability or animosity, we did a little bit of durability with the commits. But we primarily looked at consistency and isolation in my examples, okay, so the transaction log is just a log, alright, a string of log records that record all data modifications associated with a identifying number. So log sequence number, that is a sequentially increasing log number. So 12345, etc, etc, really simple. Each record is going to contain the ID of the transaction. So remember it right as we go through your we’re recording transactions that happen on the database. And then each record will also contain a back pointer to the previous record. Right. So we did roll backs, right transaction, rollback, how do you know where to rollback to? Right? That’s thanks to the transaction log. This is what gives us that durability, right whether something is committed or not committed. And so when it when a record is committed, then we don’t we don’t rollback anymore, so that that part of the transaction log can be released from the log itself, because we don’t need it anymore. But if we roll back, then we can use that backwards pointer to reverse whatever actually happened in that period. So that is the benefit here of our transaction log.
So many purposes, mainly for acid. That’s the really big purpose here, why why we need a transaction log. But with it not, we can also as we as we showed last time, we can recover certain things, as part of our trends are individual transactions that we execute. So we can choose to rollback, or we can choose to commit. But we can also roll forward and rollback for transactions if we have anything that was incomplete. So for example, if you were in the middle of a transaction on a database, and the database connection, failed, or the database just failed completely, maybe the power went out and you didn’t have a battery backup. Well, that transaction, that incomplete transaction should in theory exist in the transaction log regardless. And so when the database actually starts up, if there’s an incomplete transaction in the transaction log, it will automatically roll that incomplete transaction back, right, because it is not consistent. And so in that sense, we roll that back to ensure data durability as part of our ACID properties. So that’s what we roll back. And we can even roll back an entire database that not just an individual transaction, but we can roll back the database itself if we have a complete transaction log, because remember that transaction log is going to record every operation that happens. And so if we have a complete record of those transactions, we can either roll forward or roll backwards, depending on which direction we need to go. Okay.
So SQL Server itself does this for replication. So we can say we have, if we want to ensure our data, high availability of our data, let’s say we have a backup, we could we could have a base, a lot of times what happens we have a ghost data, like a ghost database. So we have Database A, which is the one that is front facing to the world. But then we have database B which is the exact same database, it’s just a copy of it. And so when a transaction happens on database, A, that transaction can be ghosted on to database B, and so that database B is a relatively close to close up to date, version of the production level. So if the production level database gets corrupted or fails in any way, then the fallback database can be activated and everything. There’s no interruptions as a result. Thank you. That’s one way we can use this. And so that encourages high availability of our data, right? So if we can replicate, we can ensure that there are no interruptions for our database, which is really important if you’re running something like an online service, or even something internal, on a on a desktop app. So what is recorded? Well, initially, for we’re dealing with transaction, so the beginning and end of each transaction, of course, and then every data modification associated in between those transactions. And so whether or not you actually explicitly begin a transaction, the these might be recorded underneath the hood anyways, and all of this all these trends, all this transaction log, you don’t have to do anything for it.
This is managed and done by SQL Server itself. As this is nothing that you really have control over or need to worry about as a database designer. This is just happened. This all just happens underneath the hood. Every basically, so when you when you write data, update data, cetera, et cetera, right, you’re allocating new data Speight, you’re allocating new memory. So every page every data allocation, that is also recorded as part of the transaction log. Likewise, if you create a table or an index, we’ll talk about indexes next time. But if you’re creating a table or dropping a table that is also recorded, because we want to be able to roll that back, if so, if that happens, and doesn’t get completed fully, that could really mess up your data, right. And so we want to be able to reverse that, if we have to. So, this poor how much time I got. So I may or may not, I might not have enough time to actually draw a log out, I plan on drawing one, but doesn’t look like I’m going to have enough time to talk about it in theory. So the log itself is just a wrapper around file.
So think of a circular array. So you have an array that just constantly wraps on itself. As, as you grow, if you run out of space, it just keeps on growing. And then if you and, and what the transaction log, if you if you get to if everything has been committed, right then and nothing is, is available or nothing, no other transactions are active. So we have that means we have a dead part of the dead part. But we have a part of the log file that we don’t need anymore, because there’s no way we can, there’s no need for us rollback at that point, because there’s nothing active, and everything has been committed or rolled back by already. And so when that happens, those records can be released. And then the file can then run back into that spot and the data on the file. And so in that sense, everything is partitioned into virtual files. And so you can kind of imagine this log file being segmented into chunks.
And each of those chunks can be active or inactive at any time. So inactive meaning that even if there is data inside of that, inside of that virtual log file, it means that anything that is in there isn’t needed anymore. And so that can be either overwritten or deleted, after being that it has an active transaction as part of it. And so that’s what we each of those stores that indicator, whether it is active or inactive, we also keep track of the minimum lsn. Right. So this is the transaction number, right? So we are the the specific log sequence number. So we want to keep track of the minimum number that we have to roll back to if something happens. So if let’s say we have a bunch of live transactions were in the transaction log do we have to roll back to to guarantee acid right and so that is the number that the log file keeps track of. And that way if does something does happen then a nuclear option so to speak, then we can roll back to that minimum lsn minimum log sequence number in order to ensure that our database ACID properties are hold or held. Sorry So right ahead logging.
So this is done. So basically, we are logging our actions before they’re actually happening on the disk itself. So if an update, if a transaction happens, and an update happens in that transaction, the record of that update is stored into the log file before the update is actually physically written to your hard drive. Because right, the hard drive is going to be a large source of potential errors, right, because as a physical device, it’s the only primary physical device on your computer. And so it is the subject of many errors, and hard drives are more likely to fail. And so we want to be sure to have record of that actual update before it actually happens physically, in case something happens there. Because right if, if something does physically happen with the hard drive, then that doesn’t actually, that has the potential of never making it to your log file. And if it doesn’t make it to your log file, it doesn’t happen. And there were four, we can’t assure acid. And that’s what we’re trying to achieve here. But we also do have these ideas of checkpoints.
So this will essentially happen every once in a while. So after so many X number of transactions or whatever happen, and you have no active transactions in the system, it will do a checkpoint to basically commit everything, flush anything that is in memory, too, and write it to disk. So anything that is being stored in memory in RAM, for your log file gets actually physically written to the hard drive. So nothing is actually lost. So checkpoints are very useful. And then, of course, every once in a while, the log gets truncated. And so inactive virtual log files get flagged for use, right. So if your log file is getting too, this helps prevent your log file from getting too big. Because if you don’t want this infinitely growing file in your system, because data that you wrote two years ago, is irrelevant to data that you have now. And so there’s that space where that transaction information is stored can be reused for later.
Typically, typically, if there is no active transactions, and that virtual log file would and it is not currently. So if a lot if a virtual log file is full, and it has no active transactions as part of it, that need that could potentially needed to be rolled back to. And that’s indicated by this minimum log sequence number. So if the if there’s no if the sequence number and that virtual log file is lower than the minimum lsn, then that that log file, that virtual log file can be flagged as inactive. Because there’s there is what this Minella said that we would never be rolling back into anything in that virtual log file, therefore, it can be flagged for reuse. And then it can be truncated and save save memory. If I had time, I would totally sit here I would sit here and draw draw out a log file. But we are short on time today. So I’m going to skip over this part.
But as you can imagine, so these zeros up here at the top indicates virtual log files. And so you can kind of see here is I have this segmented into four virtual log files. So 1234. And then down here at the bottom is the log sequence file or the log sequence number. So this is 012345 cetera, okay. And so in each one of these, you would actually see things like like one B, which is the begin transaction one, and C as complete transaction or commit transaction, and so on and so forth. So, if you want to see some examples of these, I can do a quick Google search and see some see some examples of this. Again, this is more just for your information, not necessarily expecting you to really truly remember For all the details of this, but just know that the the really big point here from today is that the log file, the log transaction file is important to know that it exists because that’s what ensures atomicity and durability for acid.
Okay? So when the log gets truncated, so what happens typically, when the log gets backed up, so or after a checkpoint, that’s when virtual so that’s when virtual log files are flagged for being reused. That’s typically when that happens. It also happens during database backups and database recoveries, right. So when backups happen, that transaction log is heavily relied on. And so we can see truncation happened during that phase as well. But the reason why we keep that log file nice and neat and clean is so we can do recover certain things as part of your database, things like rollback, yeah, checkpoints are done automatically by SQL Server. And off the top of my head, I can’t remember exactly every time when they when it determines when a checkpoint is needed.
But yeah, the SQL server handles the checkpoints. So during recovery, we have a couple things here, redo and undo. So redo is rolling forward committed changes. So if something gets interrupted, we have the transaction log from the future, right. So if a transaction happens, but we need to reapply that transaction, we can do so we can also roll back that transaction as well. And then we can also rollback or redo or undo certain logical operations, right, like individual updates, individual deletes and things like that, right. So transaction wise, we can roll back and roll forward. Same thing we can do, we can do the exact same thing to for the logical operations because we have to be able to do logical operations order to actually undo or redo a transaction so it makes sense