Chapter 8

Set Operators

Subsections of Set Operators

Introduction to Set Operators

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be talking about sets. So everything that we’ve been doing so far in databases has been primarily working with sets. If you remember from data structures, sets are pretty much just an unordered list. But operations on sets work a little bit differently when comparing to other data structures. So set operators that we’ll be covering today, take in on the left side, a query and the right side a query. And so the input query one input query to this is just a fully executed query results. So if we just run a select star from Table B, and set operators select star from Table A, we can combine all those results in a certain way based off of which set operator we use. And that of course, there’s also an optional order by so remember, sets are just unordered lists. And so our query results unless they are actually ordered, are not guaranteed in any particular sorted order. But let’s talk about the general processing order here with our set operators. So first off, our input queries are going to be actually executed before the operator gets a chance to actually execute, of course, because we need our operands in full first, so the input query on the left will be executed first, and then the input query two is going to be executed. And then those two are then combined using the set operator. Now, all the normal logical operating phases here are relatively normal as you would expect, with the individual input queries, except with the order by order by is a little bit different here and inside of or utilizing a set operator. So it doesn’t really matter at all what order the input queries here are to the set operator. Because sets in nature are an unordered data structure. And so order matters not for running our set operators like Union intersect and except that we’ll cover here in a little bit.

And so input query one and two should really don’t need an ORDER BY clause as part of them. But you can order the results of the set operator by adding order by after the set operator actually finishes executing. So along with the set operators in SQL Server, we have also a set of multi set operators. Now this is also this also exists inside standard SQL as well. So really, what’s the difference here between a set operator and a multi set operator? Well, generally speaking, set operators are typically concerned with just the existence of a row in a set. So if I’m combining a set, set a was set B, what is in set A what rows are in set A what rows are in set B, and then I work on combining those based off of the existence of a value in one set or the other, a multiset operator is going to be concerned with the number of occurrences of that value in each set. So that’s the very minute difference here. But the benefit that the multiset operators actually allow us to have is that if we have things like duplicates, we can actually include those duplicates as part of our results if a multiset operator exists for the set operator we’re actually working with. So I do want to very briefly here, just kind of refresh everyone’s memory for what a set on paper will look like. You should have already seen this with data structures. But I wanted to highlight a couple of differences here. So we’re familiar with this, right? We have 123 here. So that is a list, right? So this is a list. And then we have this notion with parentheses. So we have 123 here, this is a tuple and then our sets are typically denoted using curly brackets 123. So I really just wanted to write this up here very quick like primarily because if you look for any information about sets, whether it be in your math class or online, searching for set operators, things like that, you will see this In this style of writing here when defining a set, so just keep that in mind. But most of the time, most of the work that we’re actually doing with databases is just the result set of a query. But in the following videos, we’re going to be talking about a couple of the set operators that we have access to in SQL Server.

Union

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be taking a look at our first set operator union. So the syntax looks very similar to the generic syntax that I’ve shown so far. We have an input query on the left union with an optional all a qualifier. And the right hand side was, which is the input query to add an optional ORDER BY clause. So as remembering as well, the input query, and the two input queries will execute in full and their normal processing order before the union starts to happen. So let’s talk about what union actually contains, the union of two sets is going to create a set containing all members from A and B. So A union B is take everything from a and combine it with B, and then we come and then we end up with one set with everything from both. So what does that mean with our actual operations and databases? Well, in databases, it’s a little more confusing overall. So the mathematical way of representing this is we have A union B here. So the U is the actual relational set operator in algebra. But in databases, we actually have two union operators, we have union all and union distinct. So union distinct is going to be the default. So let’s, this is the default. So if the all keyword is not specified, the distinct union is going to be done. So the actual end result here is going to be the same. So when we union A with union B, we actually get all of a, all of B, and union all gets all of a all of B, and then all of the bits that they share together. So that’s what my Venn diagram here represents. And you should be a little bit familiar with what this Venn diagram looks like. We, we’ve shown this style of Venn diagrams when we were talking about Boolean operators, in CC 110. So the primary difference between the two here is that distinct, does very similar things that you would do a select distinct for. And so union all can actually have duplicate values. So if you have 123, and set A and 123, and set B, then we have double one, two threes, and the result, so we have 123123.

And our final result, if we have 123123 with with a b, then the duplicates actually get removed, and so you’re left with only one of each. So if let’s say here we have 123, and 345. Union, right, this is going to result in a set, that is with all 123345. And again, the ordering is not guaranteed, right, because sets are unordered. So the results can differ depending on the original ordering of the set, or the operation that’s actually done. So let’s do a little wiggle line here. And then over here, let’s do the same exact set operation 123. Union, then we have 345. This is going to result in a set that is 123. Or by this is the primary difference for our distinct union versus our all union so all will contain duplicates, if there are the same value in both the set A and set B. But in our union distinct if had duplicate values exists. So the same value exists in set A that exists in set B, then those those duplicate values are removed and we’re only left with a unique set. Not all of the set operators that will come Today we’ll have the distinct and all options. So most of them will actually have one or the other. But union is a unique situation where we actually have the union all and union distinct. But let’s take a look at a few examples of SQL running the union set operator. Alright, so let’s take a look at this example. Here, we have just a simple query here, where we are are actually selecting the email address from our customer table joined with the people table and trying to figure out who is the what are the emails for the people who are primary contact persons, and what are the emails for the alternate contact persons.

And so if I run each of these separately, let me connect to the database here, we get all of the email addresses as a result here. And we have 663, email email addresses, because there are 663 customers, right. And then the people who serve as alternate contacts, we can check that as well. So we have 402 rows of alternate contacts. So just over just a little over 1000 email addresses, overall. So let’s take a look at the result of actually doing a union here. So this is our input query. One, this is input query, sorry, this is input query two. And then we end on an ORDER BY, and we end on an order by. So that’s the kind of weird operation here because this looks like this is this looks like the order by actually belongs to this query here. But it actually belongs to the result of our union. Alright, so this order by actually comes after the union actually finishes, so the, this query executes, this query executes. And then the results of those two queries are then are then applied the union operator, so then we get the result from our union operator that is then ordered. So input query, then input input, query one input query to union, and then order by so let’s go ahead and give this a run. There we go. And we have a little under 1000 rows, so 984 rows here as a result, so that our that’s all of the email addresses of the primary contacts and the alternate contacts. But notice that it is a little less than 1000. Right? We had, we should have over 1000 email addresses.

Well, the default behavior here is union distinct, right? So if I don’t include so distinct here, we don’t actually have to, we don’t actually write distinct here. And actually, if you try to write distinct here, it’s going to give you a syntax error. So we don’t want to put that there. But that is the default behavior, right? Union distinct is implied if the all key word is not. So if we do union all we get the duplicates. So if we do union all so we have 984 rows with Union distinct, and then if we run this again, we get a Yeah, 1065, for union all so that means we have 80 ish, 80 ish emails that are duplicates, that means that there’s about 80 people who serve as both the primary contact person and the alternate contact person, at least for the email addresses. So that is, the is the essence of a union operator, I actually find union pretty useful if you just want to combine the results of two different queries. It’s actually a quite expressive way of doing so. You can like this particular query here, we’ve done this, but this operation before, we can actually rewrite this using joins, and we could also rewrite this using sub queries and CDs and all sorts of other things. But this is just one other way of actually achieving the same result. And it’s not necessarily more or less efficient. Sometimes, it can be more efficient to do a join here because he In this case, I’m actually having to pull all of the customers twice. And so union between these two is a little less efficient because of that fact. But in some scenarios, you’ll actually find it very difficult to write, rewrite a union operation as a join. So sometimes this actually is impossible to do. So the set operators would be the preferred way to go in those scenarios. But just remember here as well, we have the difference between union distinct which is the default behavior or union all if you actually specify the all key word here next to the union. That will do it for our union set operator. But here Next, we’ll start talking about intersect and accept

Intersect

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be taking a look at our next set operator intersect. So the the query syntax is very similar to what we’ve seen so far, we have an input query on the left and input query on the right the set operator in between, and an optional order by at the end. But do note that intersect does not have an intersect all option. So let’s take a look at what the intersect actually does. So the intersection between two sets results in the elements that are shared between the two sets. So the elements in B and the elements in A that are shared, that is what is left after the intersect actually fully happens. Now, as I mentioned, the intersect itself does not have an intersect all in T SQL, although we can do a little bit of a workaround here to get that to happen, if that is something that is needed. But let’s take a look at how this works with our Venn diagram. So intersect works very much like the end operator in Boolean algebra. So if we had a and b, we had the things that A and B shared. So for intersect A intersect B is going to result in a set such that the values and set that are in set A and in set B are returned. And we can actually also draw a Another similar example of we as we did with Union. So let’s get that going to have a set one, two, and three here, intersect a set that has three, four, and five, which will result in a set where that only has the numbers that are shared between sets A and B. So only three. All right, so we just did union here just a little bit ago. And let’s take a look at what this looks like with intersect. And I’ll be using the same exact, mostly the same exact query throughout all of my union examples just so it’s a little bit more consistent. So this is the exact same query that I ran with Union, except now we have intersect in between. And remember, the order of operations here is the input query one, followed by input query two, and then the intersect operation between the results of those two queries.

And then the ORDER BY THE ORDER BY is not part of the second query. So let’s go ahead and give this a run. Sorry, there we go. So we get our email addresses again, and you can see that we actually only have 26 rows now. So the intersection between customers who are the primary contact person, their emails that are the primary contact, and emails that are alternate contact, there are only 26 customers who have the same contact for both. So email addresses that are both the primary and alternate contacts. And again, we have done this already before using joints. But just kind of showing you an example of how intersect will work between the result of two queries. Now, as we mentioned earlier, the intersect operator does not have intersect all, but we can simulate the effect of intersect all so something like this here. Now, this example what Wide World databases are while Wide World importers database doesn’t actually work in full. So when you run this, you actually still get 26 rows. But it kind of highlights how this work round would work if we had data in our database that actually had the overlap. But primarily there is there is no the reason why there’s only 26 years because there isn’t an email that is used more than once for the primary and alternate, so it’s not there’s no duplicates here. So that is the primary reason there but we can actually look at the ending results here and showcase this so you remember the over clause, we are doing a window function right? Row number over.

And then here’s our window function write partition by email, and then order by. And then this is essentially going to select the, the first value as part of our window and then sort by that. So essentially, it’s sorting by our sorting by sorting our rows, sorting our numbers here, the row ID, the weird thing here with this, we can’t really order as we normally would, right. So that’s kind of why we have this, this, this order by here. So if we actually take this out, see here. And run this. Alright, we can actually run row number, so we’re actually going to order by here, so order by select zero, and give this a run. And what that select Order by select zero is going to let us do is actually order the window that we’re actually picking. So if we run just this individual query, this will actually be a little bit more informative. Here, we can actually already see some of the results here, if I scroll down just a little bit, rows six and seven, we have able, at example, comm here, and we have row ID one and two, you can kind of see how this query would actually end up helping us doing a intersect with duplicates, as a result. So if we wanted to, if we wanted if our second query here had that same email address, so if I actually, let’s go ahead and select this, run that and you see Abel doesn’t actually appear appear here, because Abel does isn’t actually used as an alternate contact email. So that’s why this intersect. Trick doesn’t actually work with this example, because there’s just no duplicates to join on or to include as part of the intersect all. But if we did have duplicates here, this trick with the row number would actually get as good as there, because it works here, you can actually see that we have emails that are used more than once there. But if we don’t include this window function, then the intersect ends up pulling that out, because it combines all the same email address into one. But that unique numbering between duplicate emails works pretty well.

And so if you find yourself in a situation where you need to do an intersect all this would be a trick to actually try, I can show the end result here, if we actually let me let me undo here. Let’s go back to this result. And I’m going to actually change the top and bottom to be the exact same query just to show what can happen with intersect all so here we have 623 rows with the intersect between these two queries. And so just one of these queries by itself, remember a 663 emails, so we have roughly 40 ish speaking that are duplicates. And so if we wanted to do that exact same thing, but do the intersect all, you know, I’m going to do the exact same thing as I did just a little bit ago. And I’m going to put the same query down here on the bottom. That way, when we run this, we actually get the duplicate emails. So now you see I have all 663 rows, and Abel gets actually put out here twice now. So the row the emails that are used twice as a primary contact. So just a little example of how we could get around the SQL limitation without the intersect all but again, by default, the intersect is just intersect distinct, and even in this case, this is still intersect distinct, I am just kind of exploiting the the intersect operator by making the duplicate emails not unique. So they get left in using the intersect operator. Of course, generally speaking, we wouldn’t actually want the row ID to be in the end result of our query here. So we could strip that out by using a CTE so we can actually use a set operator and inside of a look for a CTE as well. So, with intersect all CTE, our CTE there is accepted. Now that we just ran, and now we can run this and strip out the strip out the row ID not including that. We don’t want to strip that out inside here. Because without that row ID the intersect all wouldn’t work. So we want to leave that row ID for the actual set operator. And then if we don’t want it as returning that results to our users or whatever, we can strip that out by utilizing a CT and an additional query

Except

YouTube Video

Video Transcription

Welcome back everyone. And in this video, we’re going to be taking a look at the Accept set operator. So accept is going to have similar syntax as we’ve seen so far. Again, input to query one on the left input query two on the right with the except operator in between, and an optional order by to end with. So what does the except operator actually do? So the except operator is going to be the difference between two sets. So the result is going to be everything. So a except B is going to be things that belong to a but not to be just like a subtraction operator, essentially. Except all does not exist, just like intersect all does not exist, and T SQL. So again, though, we can use a similar workaround that we’ve done with intersect all. But again, the default behavior for accept is accept distinct, so no duplicates. So what does our Venn diagram look like for the Accept set operator, so the Accept set operator works very much like our subtraction, right, so we have all of a except B, so A, but not B, if we’re talking in Boolean, so all of a but we exclude anything that anything of a that is shared with B. So if a value is in A and B, that value is removed, as part of this except set operator, let’s take a look at an example of this set operator being applied to two sets. So we’re going to draw out the set operator that are the sets that we had previously with Union and intersect. So this would be our one, two, and three for our first set, except which looks like our minus sign.

And then our input set to is going to be three, four, and five. So we have set a 123, set B three, four, or five. So if we have a except B, we have all of a except for values that are shared with set B. So we have one and two here, but three gets removed, because three exists also in set B, so all of a except B. But let’s take a look at some examples of this working in SQL. Alright, so here we have the similar query that we’ve seen. So far, I’m dealing with the email addresses for the primary contact and alternate contacts again, but now we’re doing the Accept. So if we execute this query, we get 597 rows this time around. But what does this actually give us? Right? What does this actually give us? Well, if we have the primary contact emails, except alternate contact emails, this gives us email addresses that are primary contacts, but not alternate contacts. So that’s what the except operator will actually get us here for our email addresses. So fairly expressive thing that is, again, somewhat that more difficult to write as a join, but still possible, still possible, just a little bit more difficult to actually write in other ways, right. But just like what we have with programming languages, right, we can the same, you know, expression, we can write in a same Boolean expression or function or whatever, we can write, or even just tic tac toe, right, we can write tic tac toe 1000s of different ways in Python, or Java or C sharp, but it doesn’t necessarily mean that any one of them is the best way. It’s just that there’s different ways of going about actually achieving that same task. So similar idea that we have in SQL, we can achieve the similar results that we tried last time for getting around the fact that T SQL does not support accept all. So we still have to use the original accepted distinct, but we still, we can still use the window trick. So if we have a window function that applies a unique number to each of the email addresses that are duplicates, we can run this and now we can actually see that we actually get more emails as a result. And then if we scroll down here, we can see there’s our friend Abel, that has it now a duplicate email as as a result of the Accept.

So this example does work very well with the accept all you can kind of see the exploitation that I can achieve here by avoiding by getting around the except operator that pulls out or that removes duplicates. But since these two rows have a unique number now attached to them, those rows are no longer the same, they are different. But same query, same exact query, just tossing all that into a CTE. So we can run just a simple select without that row number and our end query results. That is our accept operator. And that’s actually going to be our last set operator that we’re actually going to be covering as part of this course. So there are some things to consider when you are going to use multiple set operators or chaining set operators together. So, intersect is going to actually take precedence over union and accept. So intersect will happen first, and then union and then accept, although union is not exclusive, union and accept have the same precedence. So if union comes first, so union will happen first if it comes first before accept, or vice versa. But you can just like algebraic equations and other things in SQL, you can enforce order by using parentheses. So we can take a look at a couple of example, or here, let’s just take a look at one example.

And I’ll include more examples as part of the notes. But here is a simple query, where I have a list of brands, champion, Nike Under Armour, and then we have Nike Under Armour down here and then intersect Nike, Adidas. So the end result of this is champion and Under Armour. So let’s break this down real quick. If we do this query here, we get champion so champion, Nike Under Armour, except Nike Under Armour. So that leaves us with just champion left because that’s the one that is in set A but not in set B. And then we have intersect with Nike, Adidas, but that does, that’s not the actual execution order, right? Because we don’t execute the Accept first, because if we did, then champion intersect Nike, Adidas would actually result in an empty set would actually result in an empty set. What actually happens first, is this intersect, right? So the intersect happens, so we run that we get Nike, right? We get Nike, and then champion Nike Under Armour intersect Nike is going to just take out Nike and leave us with champion and Under Armour as a result. So we can actually force that order if we wanted to change things a little bit. So let’s say I wanted to force this, except to happen first, I can actually wrap that in parentheses. And so if I run that, Aha, there we go. Now we get what we expect, right? So this whole statement executes first as part of input query, one for the intersect function. So that is going to be something that we can achieve using parentheses. So just a quick little example to show that precedence of the set operators does matter. So order of operations is still going to apply when you’re utilizing the set operators, much like what you expect with many other things in SQL and your programming languages.