LINQ
Bridging the Gap Between Databases and Data Structures
Bridging the Gap Between Databases and Data Structures
We saw in the last chapter how many web applications are built around the concept of resources. An online store has products to sell. A social media application has people with accounts. Resource-oriented web applications therefore need to store these resources in some fashion - possibly as objects in some kind of collection, or possibly with a database.
Some key terms to learn in this chapter are:
The key skills to develop in this chapter are:
We use the term “Database” to describe a program primarily designed to efficiently store and retrieve data. These are often used in conjunction with other kinds of applications - such as web applications or desktop applications. Why use a separate program rather than building this storage and retrieval into the application we are building? There are a lot of possible reasons, but some common ones are:
Centralized access. Consider a desktop application that tracks inventory for a large company. You might want to have the application installed on every computer in the sales department, and you want the inventory to stay up-to-date for all of these. If each desktop app connects to the same centralized database, they will all be kept up-to-date.
Efficiency of specialization. By designing a database program for the sole purpose of storing and retrieving data, it can be designed by programmers who specialize in this area, and be designed to be as efficient as possible… often far more than we would be able to build into our own applications.
The most common type of database in use today are relational databases. These are databases that organize data into structured tables consisting of columns with a specific data type. Each record is therefore a row in the database. Visually, such a database appears much like a spreadsheet. These databases are typically communicated with using Structured Query Language (SQL). This is a special programming language developed specifically to express the kinds of operations we need to carry out on these databases.
For example, to retrieve the first and last names of all students in the students table who are earning above a 3.0, we would use the query:
SELECT first, last FROM students WHERE gpa > 3.0;
And the results would be provided as a stream of text:
first | last
Mike | Rowe
Jan | Herting
Sam | Sprat
You’ll learn more about these databases in CIS 560.
Relational databases were developed in the 60’s and 70’s to deal with the specific challenges of computer systems in those days. The way they organize data lends itself to large, flat files, and SQL and its results are handled as streams of data. When object-orientation became a popular programming language paradigm, object-oriented databases - databases that store data as objects, were also developed. However, while object-oriented programming languages became quite popular, the corresponding databases failed to. As a result, object-oriented databases remain quite niche.
The popularity of JSON and XML with the web led to another category of databases being developed, document-based databases (sometimes called No-SQL databases). These databases store records in a serialized format like JSON or XML, and downplay relationships between records. Unlike relational databases, document-based databases can be flexible in what data exists for an individual record.
For example, if we were to create a Student table in a relational database, and wanted to hold the student’s degree plan, we’d need two columns to represent a dual-major. All students in the table would have two columns, even if they only had one degree plan. In contrast, with a document-based database most students would only have one degree plan, but dual major would have two, and if we had a rare triple major, they could have three!
Thus, document-based databases add flexibility at the cost of some efficiency in retrieval.
Despite the growing popularity of Document-Based Databases, the vast majority of databases remain relational (MySql, MSSQL, PostgreSQL, Oracle, SQLite). Yet, most of the programs that utilize them are now object-oriented. This creates a bit of a disconnect for many programmers working with both. One of the solutions devised to help with this challenge are object-relational mappers (ORMs). These are libraries that are written for object-oriented languages that facilitate talking to the relational database. When a query is run against the database with an ORM, the results are provided as a collection of objects instead of the normal text (basically, the ORM parses the results and transforms them into an object representation). This simplifies working with databases for the programmer.
One solution that Microsoft has developed to tackle the disconnect between relational databases and C# is Language Integrated Query (LINQ). This technology integrates querying directly into the C# language. It can act as a bridge to a MS SQL server, replacing the need for writing SQL queries and processing the results. But it can also be used to query collections that implement the IEnumerable
interface, as well as XML files. As such, we can design the logic of our program to use LINQ, and change out the data source for any of these options with minimal refactoring.
LINQ queries are written as either query expressions or by using query operators. Let’s examine the two.
LINQ query expressions appear much like SQL statements. For example, the query from the last section, selecting the names of students whose GPA is greater than 3.0 as a LINQ query expression would be:
var highGPAStudents = from student in students where student.GPA > 3.0 select new { First = student.First, Last = student.Last};
This format looks much like SQL, and is often more comfortable for programmers who come from a SQL background.
LINQ queries are actually implemented through C# operators. Query expressions, like the one above, are compiled into an equivalent series of query operators. Programmers can also use these operators directly. For example, the above query expressed using operators would be:
var highGPAStudents = students.Where(student => student.GPA > 3.0).Select(student => new {First = student.First, Last = student.Last});
Programmers from an object-oriented background often find the operators (which are essentially methods that take lambda expressions as parameters) more comfortable.
Queries are not actually executed until you start iterating over their results. This allows you to chain additional queries on an existing query, and allows the compiler to optimize queries by grouping query expressions together. Consider this compound query to select half of low-performing students to assign to an advisor:
var strugglingStudents = from student in students where student.GPA < 2.0 select student;
var strugglingStudentsAtoN = from strugglingStudents where student.Last.CharAt(0) >= 'A' && student.Last.CharAt(0) < 'N' select student;
If we wrote this as C# algorithm, we might do something like:
var strugglingStudents = new List<Student>();
foreach(var student in students) {
if(student.GPA < 2.0) strugglingStudents.Add(student);
}
var strugglingStudentsAtoN = new List<Student>();
foreach(var student in strugglingStudents) {
if(student.Last.CharAt(0) >= 'A' && student.Last.CharAt(0) < 'N') strugglingStudentsAtoN.Add(student);
}
As you can see, this results in two iterations over lists of students. In the worst case (when every student is struggling) this requires 2*n operations.
On the other hand, by delaying the execution of the query until the first time its values are used, LINQ can refactor the query into a form like:
var strugglingStudentsAtoN = new List<Student>();
foreach(var student in students) {
if(student.GPA < 2.0 && student.Last.CharAt(0) >= 'A' && student.Last.CharAt(0) < 'N')
strugglingStudents.Add(student);
}
With this refactoring, we only need one iteration over our list - our query would run twice as fast! Also, if we never use strugglingStudentsAtoN
, the query is never executed, so the cost is constant time. This might seem nonsensical, but consider if we have some kind of conditional, i.e.:
switch(advisor.Number) {
case 1:
ReportStudents(strugglingStudentsAtoN);
break;
case 2:
ReportStudents(strugglingStudentsNtoZ);
break;
}
We only end up executing the query necessary for the logged-in advisor.
LINQ uses a programming pattern known as method chaining, where each query method returns an object upon which additional query operations can be performed. Thus, it is perfectly legal to write a query like:
var query = students.Where(s => s.GPA > 2.0).Where(s => s.Age > 25).Where(s => s.Last.CharAt(0) == 'C');
While this may seem silly (as we could have expressed this with one where clause), it makes more sense when we have user interface filters that may or may not have a value, i.e.:
var query = students.All();
if(minGPA != null) query = query.Where(s => s.GPA >= minGPA);
if(maxGPA != null) query = query.Where(s => s.GPA <= maxGPA);
if(first != null) query = query.Where(s => s.First == first);
if(last != null) query = query.Where(s => s.Last == last);
The result of the query is therefore a specialized object created by LINQ that implements the IEnumerable<T>
interface. The type of T
depends on the query (queries are always strongly typed, though the type can be inferred). For example, in our strugglingStudents
query, the result type is IEnumerable<Student>
:
var strugglingStudents = from student in students where student.GPA < 2.0 select student;
In contrast, the highGPAStudents
result uses an anonymous type:
var highGPAStudents = from student in students where student.GPA > 3.0 select new { First = student.First, Last = student.Last};
The anonymous type is created by the expression new { First = student.First, Last = student.Last}
. Basically, it’s an object with a First
and Last
property (and no methods or other properties). Anonymous types are created by the interpreter at runtime (much like an auto-Property’s backing field). As such, we aren’t able to use its type in our code.
If we want the query to return a specific type, we can instead declare a struct or object to return, i.e.:
class StudentName {
public string First;
public string Last;
public StudentName(string first, string last) {
First = first;
Last = last;
}
}
And then set this as the projection type:
var highGPAStudents = from student in students where student.GPA > 3.0 select new StudentName(student.First, student.Last);
Let’s take a deeper look at LINQ syntax next.
Let’s discuss some of the more common operations we might want to perform with LINQ. Please note that this is not an exhaustive list - just some of the most common operations you will be encountering at this time.
For these examples, we’ll be using a data source consisting of Student
objects. These are defined by the class:
public class Student {
public string EID;
public string First;
public string Last;
public double GPA;
public int Age;
public string Major;
}
And the variable students
is a List<Student>
that we can assume is populated with initialized student objects.
We select this as our data source with the LINQ from
operator. In query syntax, this would be:
var query = from student in students ...
And with method syntax, we would simply use the students
list:
var query = students.SomeQueryMethod(student => ...)
To create a query from a data source using method syntax without applying any query methods (useful for chaining optional queries), we can invoke All()
on the collection:
var query = students.All();
To use a different data source, we would just swap students
for that source, an object that supports either the IEnumerable
(usually data structures) or IQueryable
(typically SQL or XML data sources) interface.
Projecting refers to selecting specific data from a data source. For example, if we wanted to select the full name of every Student
, we could do so with this query syntax:
var studentNames = from student in students select $"{student.First} {student.Last}";
Or with method syntax:
var studentNames = students.Select(student => $"{student.First} {student.Last}");
As the name is simply a string, the select above simply constructs the string, and the type of studentNames
is inferred to be IEnumerable<string>
.
We can also project an anonymous type. This is a special kind of object whose type is created at runtime. Anonymous types are basically collections of properties and nothing more (they cannot have methods). For example, if we wanted just the student’s full name and age, we would use this query syntax:
var studentInfo = from student in students select new {FullName = $"{student.First} {student.Last}", Age = student.Age};
or this method syntax:
var studentInfo = students.Select(student => new {FullName = $"{student.First} {student.Last}", Age = student.Age});
Finally, we could also define a new data type (i.e. class) and create an instance of it as our projection:
class StudentInfo {
public string FullName {get; set;}
public int Age {get; set;}
public StudentInfo(string fullName, int age)
{
FullName = fullName;
Age = age;
}
}
Using query syntax:
var studentInfo = from student in students select new StudentInfo($"{student.First} {student.Last}", student.Age);
or this method syntax:
var studentInfo = students.Select(student => new StudentInfo($"{student.First} {student.Last}", student.Age));
One of the most common operations you will do with a query is filter the data, so the results contain only part of the original data. This is done with the where
operator takes a statement that resolves to a boolean. If this boolean expression resolves to true, then the data is included in the results; if it is false, it is excluded. For example, to find all students older than 25, we would use this query syntax:
var olderStudents = from student in students where student.Age > 25 select student;
or this method syntax:
var olderStudents = students.Where(student => student.Age > 25);
If we have a list that contains multiple types, we can filter for specific types with the where
operator or the OfType
operator (this is an instance where query and operator syntax vary more greatly). Consider the case where our Student
class is a base class to GraduateStudent
and UndergraduateStudent
classes. If we wanted to get a list of only the undergraduates, we could use a where
query syntax combined with an is
casting test:
var undergraduates = from student in students where student is UndergraduateStudent select student;
In this case, the result would be an IEnumerable<UndergraduateStudent>
. But the corresponding where in operator syntax would result in an IEnumerable<Student>
that contained only UndergraduateStudent
objects. To perform a cast as part of the filtering, we would instead use the OfType<T>()
method:
var undergraduates = students.OfType<UndergraduateStudent>();
Often we want to apply some form of sorting to our results, i.e. we might want to sort students by GPA. This can be done with an orderby
operator. In query syntax it would be:
var studentsByGPA = from student in students orderby student.GPA select student;
And in method syntax:
var studentsByGPA = students.OrderBy(student => student.GPA);
The orderby
operator sorts in ascending order (so students with the lowest grades would come first in the list). If we wanted to sort in descending order, we would need to specify descending order in our query syntax:
var studentsByGPA = from student in students orderby student.GPA descending select student;
There is also an ascending keyword you can use. This is helpful if you can’t remember the default or want to make it clear to other programmers that the list will be sorted in ascending order:
var studentsByGPA = from student in students orderby student.GPA ascending select student;
However, in method syntax this is accomplished by a separate operator, the OrderByDescending()
method:
var studentsByGPA = students.OrderByDescending(student => student.GPA);
If we need to order by multiple properties, i.e. first and last names, this is accomplished by a comma-separated list in query syntax:
var studentsByName = from student in students orderby student.Last, student.First select student;
But in method syntax, we need to use a ThenBy()
operator for subsequent sorting options:
var studentsByName = students.OrderBy(student => student.Last).ThenBy(student => student.First);
We can mix and match ascending and descending sorting as well - for example, to sort students by descending GPA, then by names in alphabetical order we would use the query syntax:
var studentsByGPAAndName = from student in students orderby student.GPA descending, student.Last, student.First select student;
The corresponding method syntax would need separate operators for each sorting:
var studentsByGPAAndName = students.OrderByDescending(student => student.GPA).ThenBy(student => student.Last).ThenBy(student => student.First);
There is also a ThenByDescending()
operator for chaining descending sorts.
Finally, there is also a Reverse()
operator which simply reverses the order of items in the collection without sorting.
We often want to split our results into groups, which can be accomplished with the group by
operator. Consider the case where we want to split our students by the value of their Major
field. We can accomplish this with query syntax:
var studentsByMajor = from student in students group student by student.Major select student;
or using method syntax:
var studentsByMajor = students.GroupBy(student => student.Major);
The result type of a grouping operation is an IEnumerable<IGrouping<TKey, TSource>>
; the IGrouping
is essentially a key/value pair with the key being the type we were grouping by. In the example it would be IEnumerable<IGrouping<string, Student>>
(Seeing this, you can probably begin to appreciate why we normally use var
for query variables).
To print out each student in each category, we’d need to iterate over this collection, and then over the groupings:
foreach(var group in studentsByMajor) {
Console.WriteLine($"{group.Key} Students");
foreach(var student in group) {
Console.WriteLine($"{student.First} {student.Last}");
}
}
A common strategy with large data sets is to separate them into pages, i.e. the first 20 items might appear on page 1, and by clicking the page 2 link, the user could view the next twenty items, and so on. This paging functionality is implemented in LINQ using the Skip()
and Take()
operators. The Skip()
operator specifies how many records to skip over, while the Take()
operator indicates how many records to include. Thus, to take the second page of students when each page displays twenty students, we would use:
var pagedStudents = students.Skip(20).Take(20);
Note that there is no query syntax corresponding to the Skip()
and Take()
operations, so to use them with query syntax, we wrap the query in parenthesis and invoke the methods on the result. I.e. sorting students alphabetically and then taking the third page of twenty would be:
var pagedSortedStudents = (from student in students orderby last, first select student).Skip(40).Take(20);
Sometimes we want to know if a particular record exists in our data source. The Any()
operator can be used to perform such a check. It evaluates to true
if the query has any results, or false
if it does not. Like the Skip()
and Take()
, it does not have a query syntax form, so it must be invoked using the method syntax. For example, to determine if we have at least one student named Bob Smith, we could use:
var bobSmithExists = (from student in students where student.First == "Bob" && student.Last == "Smith" select student).Any();
Or, in method syntax:
var bobSmithExists = students.Any(student => student.First == "Bob" && student.Last == "Smith");
Alternatively, if we wanted to retrieve Bob Smith’s record instead of simply determining if we had one, we could use First()
:
var bobSmith = (from student in students where student.First == "Bob" && student.Last == "Smith" select student).First();
or in method syntax:
var bobSmith = students.First(student => student.First == "Bob" && student.Last == "Smith");
This evaluates to the first matching result of the query (if we have multiple Bob Smiths, we’ll only get the first one). If there is no matching record, an InvalidOperationException
is thrown. Alternatively, we can use FirstOrDefault()
which returns a default value corresponding to the query data type. For classes, this would be null
, so given this query:
var bobSmith = students.FirstOrDefault(student => student.First == "Bob" && student.Last == "Smith");
The value of bobSmith
would be his record (if he is in the collection) or null
(if he was not).
Sometimes we want to perform aggregate operations upon a data source, i.e. counting, summing, or averaging. As with paging, these are accomplished via method-only operators. For example, to count all students in our data source, we could use:
var studentCount = students.Count();
This can be combined with any LINQ query, i.e. the count of students with a GPA above 3.0 in query syntax would be:
var studentsAbove3GPA = (from student in students where student.GPA > 3.0 select student).Count();
or in method syntax:
var studentsAbove3GPA = students.Where(student => student.GPA > 3.0).Count();
Similarly, to compute the average GPA we would use the Average()
method in conjunction with a projection. In query syntax:
var averageGPA = (from student in students select student.GPA).Average();
or in method syntax:
var averageGPA = students.Select(student => student.GPA).Average();
or we can move the selector Predicate into the Average()
directly:
var averageGPA = students.Average(student => student.GPA);
We can create more complex queries to address specific questions. For example, with a group by
we could compute the average GPA by major:
var gpaByMajor = from student in students group student by student.Major into majorGroup select new
{
Major = majorGroup.Key,
AverageGPA = majorGroup.Average(student => student.GPA);
}
The Sum()
operator works similarly, summing a value. To sum the ages of all students, we could use:
var sumOfAges = (from student in students select student.Age).Sum();
or
var sumOfAges = students.Select(student => student.Age).Sum();
or
var sumOfAges = students.Sum(student => student.Age);
There are also Min()
and Max()
aggregate operators which select the minimum and maximum values, respectively. For example, we could find the maximum and minimum earned GPAs with:
var minGPA = students.Min(student => student.GPA);
var maxGPA = students.Max(student => student.GPA);
Finally, there is a generic Aggregate()
method which provides an aggregator variable that we can use to build any kind of aggregate function. Let’s first see how it can be used to duplicate the functionality of the Sum()
method:
var sumOfAges = students.Aggregate((sum, student) => sum + student.Age);
Here, sum
is inferred to be an int
, as student.Age
is an int. So it starts at 0
, and each time the Aggregate
method processes a student, it adds that student’s Age
into the sum
.
Now let’s use this method for something new - generating a string of email addresses for a bulk mailing. Assume our email application needs a list of semicolon-separated email addresses. In that case, we could generate the emails for all students from:
var emails = students.Aggregate((emails, student) => emails + $"; {student.EID}@k-state.edu");
If we had students with EIDs “mary”, “brb30”, and “stan”, the resulting string would be:
mary@k-state.edu; brb30@k-state.edu; stan@ksu.edu
You may have heard the term map/reduce in the context of functional programming or big data. This is an algorithmic approach to processing data. This pattern can be duplicated in LINQ using a query as the mapping function, and Aggregate()
as the reduce function.
In order to use LINQ with your IEnumerable
collections, you must include this using statement:
using System.LINQ
Without it, the LINQ collection methods will not be available. You might be wondering why, as your collections are mostly defined in the System.Collections
or System.Collections.Generic
namespaces.
The answer is that LINQ on collections is implemented using extension methods. This is a C# feature that allows you to add methods to any class, even a sealed
class. But how can we add methods to a sealed class? Isn’t the point of sealing a class to prevent altering it?
The answer is that extension methods don’t actually modify the class itself. Instead, they make available additional methods that the compiler “pretends” are a part of the class. But these are defined separate from the class, and cannot modify it, nor access private or protected members of the class. As the LINQ extension methods are defined in the System.LINQ
namespace, we must make them available with a using statement before the compiler and intellisense will let us use them.
Let’s see an example of creating our own extension methods. As programmers, we often use class names in Pascal case, that might be useful to convert into human-readable strings. Let’s write an extension method to do this transformation:
using System.Text;
namespace StringExtensions {
/// <summary>
/// Converts a camel-case or pascal case string into a human-readable one
/// </summary>
public static string Humanize(this String s)
{
StringBuilder sb = new StringBuilder();
int start = 0;
for(int i = 1; i < s.Length; i++)
{
// An upper case character is the start of a new word
if(Char.IsUpper(s[i]))
{
// So we'll add the last word to the StringBuilder
string word = s.Substring(start, i - start);
sb.Append(word);
// Since that wasn't the last word, add a space
sb.Append(" ");
// Mark the start of the new word
start = i;
}
}
// We should have one more word left
sb.Append(s.Substring(start));
return sb.ToString();
}
}
Notice a couple of important features. First, the method is defined as static
. All extension methods are static
methods. Second, note the use of this
in the first parameter, this string s
. The use of the this
keyword is what tells C# the method is an extension method. Moreover, it indicates the class that is being extended - String
(which is equivalent to string
).
Other than these two details, the rest of the method looks much like any other method. But any time this method is in scope (i.e. within the StringExtensions
namespace, or in any file with a using StringExtensions
statement), there will be an additional method available on string
, Humanize()
.
That’s all there is to writing an extension method. Go ahead and try writing your own to convert human-readable strings into Pascal or Camel case!
In this chapter we learned about LINQ, Microsoft’s query technology for querying collections, relational databases, and XML files (along with community implementations for many other queryable things). We saw how to use both query and method syntax provided by LINQ to perform common query operations. Finally, we examined the mechanism LINQ uses to provide new functionality to IEnumerable
collections - extension methods. We even implemented a few extension methods of our own!
Now you should be ready to use LINQ in your own projects, as well as define your own custom extension methods.