Chapter 5

LINQ

Bridging the Gap Between Databases and Data Structures

Subsections of LINQ

Introduction

We saw in the last chapter how many web applications are built around the concept of resources. An online store has products to sell. A social media application has people with accounts. Resource-oriented web applications therefore need to store these resources in some fashion - possibly as objects in some kind of collection, or possibly with a database.

Key Terms

Some key terms to learn in this chapter are:

  • Database
  • Relational Database
  • Object-Oriented Database
  • Document-Based Database
  • Structured Query Language (SQL)
  • Language Integrated Query (LINQ)
  • Anonymous Types
  • Extension Methods

Key Skills

The key skills to develop in this chapter are:

  • Using LINQ to perform query operations on data sets
  • Writing extension methods to provide new functionality to existing classes

Databases

We use the term “Database” to describe a program primarily designed to efficiently store and retrieve data. These are often used in conjunction with other kinds of applications - such as web applications or desktop applications. Why use a separate program rather than building this storage and retrieval into the application we are building? There are a lot of possible reasons, but some common ones are:

  1. Centralized access. Consider a desktop application that tracks inventory for a large company. You might want to have the application installed on every computer in the sales department, and you want the inventory to stay up-to-date for all of these. If each desktop app connects to the same centralized database, they will all be kept up-to-date.

  2. Efficiency of specialization. By designing a database program for the sole purpose of storing and retrieving data, it can be designed by programmers who specialize in this area, and be designed to be as efficient as possible… often far more than we would be able to build into our own applications.

Relational Databases

The most common type of database in use today are relational databases. These are databases that organize data into structured tables consisting of columns with a specific data type. Each record is therefore a row in the database. Visually, such a database appears much like a spreadsheet. These databases are typically communicated with using Structured Query Language (SQL). This is a special programming language developed specifically to express the kinds of operations we need to carry out on these databases.

For example, to retrieve the first and last names of all students in the students table who are earning above a 3.0, we would use the query:

SELECT first, last FROM students WHERE gpa > 3.0;

And the results would be provided as a stream of text:

first | last
Mike | Rowe
Jan | Herting
Sam | Sprat

You’ll learn more about these databases in CIS 560.

Object-Oriented Databases

Relational databases were developed in the 60’s and 70’s to deal with the specific challenges of computer systems in those days. The way they organize data lends itself to large, flat files, and SQL and its results are handled as streams of data. When object-orientation became a popular programming language paradigm, object-oriented databases - databases that store data as objects, were also developed. However, while object-oriented programming languages became quite popular, the corresponding databases failed to. As a result, object-oriented databases remain quite niche.

Document-Based Databases

The popularity of JSON and XML with the web led to another category of databases being developed, document-based databases (sometimes called No-SQL databases). These databases store records in a serialized format like JSON or XML, and downplay relationships between records. Unlike relational databases, document-based databases can be flexible in what data exists for an individual record.

For example, if we were to create a Student table in a relational database, and wanted to hold the student’s degree plan, we’d need two columns to represent a dual-major. All students in the table would have two columns, even if they only had one degree plan. In contrast, with a document-based database most students would only have one degree plan, but dual major would have two, and if we had a rare triple major, they could have three!

Thus, document-based databases add flexibility at the cost of some efficiency in retrieval.

Object-Relational Mapping

Despite the growing popularity of Document-Based Databases, the vast majority of databases remain relational (MySql, MSSQL, PostgreSQL, Oracle, SQLite). Yet, most of the programs that utilize them are now object-oriented. This creates a bit of a disconnect for many programmers working with both. One of the solutions devised to help with this challenge are object-relational mappers (ORMs). These are libraries that are written for object-oriented languages that facilitate talking to the relational database. When a query is run against the database with an ORM, the results are provided as a collection of objects instead of the normal text (basically, the ORM parses the results and transforms them into an object representation). This simplifies working with databases for the programmer.

LINQ

One solution that Microsoft has developed to tackle the disconnect between relational databases and C# is Language Integrated Query (LINQ). This technology integrates querying directly into the C# language. It can act as a bridge to a MS SQL server, replacing the need for writing SQL queries and processing the results. But it can also be used to query collections that implement the IEnumerable interface, as well as XML files. As such, we can design the logic of our program to use LINQ, and change out the data source for any of these options with minimal refactoring.

LINQ queries are written as either query expressions or by using query operators. Let’s examine the two.

Query Expressions

LINQ query expressions appear much like SQL statements. For example, the query from the last section, selecting the names of students whose GPA is greater than 3.0 as a LINQ query expression would be:

var highGPAStudents = from student in students where student.GPA > 3.0 select new { First = student.First, Last = student.Last};

This format looks much like SQL, and is often more comfortable for programmers who come from a SQL background.

Query Operators

LINQ queries are actually implemented through C# operators. Query expressions, like the one above, are compiled into an equivalent series of query operators. Programmers can also use these operators directly. For example, the above query expressed using operators would be:

var highGPAStudents = students.Where(student => student.GPA > 3.0).Select(student => new {First = student.First, Last = student.Last});

Programmers from an object-oriented background often find the operators (which are essentially methods that take lambda expressions as parameters) more comfortable.

Query Execution

Queries are not actually executed until you start iterating over their results. This allows you to chain additional queries on an existing query, and allows the compiler to optimize queries by grouping query expressions together. Consider this compound query to select half of low-performing students to assign to an advisor:

var strugglingStudents = from student in students where student.GPA < 2.0 select student;
var strugglingStudentsAtoN = from strugglingStudents where student.Last.CharAt(0) >= 'A' && student.Last.CharAt(0) < 'N' select student;

If we wrote this as C# algorithm, we might do something like:

var strugglingStudents = new List<Student>();
foreach(var student in students) {
    if(student.GPA < 2.0) strugglingStudents.Add(student);
}
var strugglingStudentsAtoN = new List<Student>();
foreach(var student in strugglingStudents) {
    if(student.Last.CharAt(0) >= 'A' && student.Last.CharAt(0) < 'N') strugglingStudentsAtoN.Add(student);
}

As you can see, this results in two iterations over lists of students. In the worst case (when every student is struggling) this requires 2*n operations.

On the other hand, by delaying the execution of the query until the first time its values are used, LINQ can refactor the query into a form like:

var strugglingStudentsAtoN = new List<Student>();
foreach(var student in students) {
    if(student.GPA < 2.0 && student.Last.CharAt(0) >= 'A' && student.Last.CharAt(0) < 'N') 
        strugglingStudents.Add(student);
}

With this refactoring, we only need one iteration over our list - our query would run twice as fast! Also, if we never use strugglingStudentsAtoN, the query is never executed, so the cost is constant time. This might seem nonsensical, but consider if we have some kind of conditional, i.e.:

switch(advisor.Number) {
    case 1:
        ReportStudents(strugglingStudentsAtoN);
        break;
    case 2: 
        ReportStudents(strugglingStudentsNtoZ);
        break;
}

We only end up executing the query necessary for the logged-in advisor.

Info

LINQ uses a programming pattern known as method chaining, where each query method returns an object upon which additional query operations can be performed. Thus, it is perfectly legal to write a query like:

var query = students.Where(s => s.GPA > 2.0).Where(s => s.Age > 25).Where(s => s.Last.CharAt(0) == 'C');

While this may seem silly (as we could have expressed this with one where clause), it makes more sense when we have user interface filters that may or may not have a value, i.e.:

var query = students.All();
if(minGPA != null) query = query.Where(s => s.GPA >= minGPA);
if(maxGPA != null) query = query.Where(s => s.GPA <= maxGPA);
if(first != null) query = query.Where(s => s.First == first);
if(last != null) query = query.Where(s => s.Last == last);

Query Results

The result of the query is therefore a specialized object created by LINQ that implements the IEnumerable<T> interface. The type of T depends on the query (queries are always strongly typed, though the type can be inferred). For example, in our strugglingStudents query, the result type is IEnumerable<Student>:

var strugglingStudents = from student in students where student.GPA < 2.0 select student;

In contrast, the highGPAStudents result uses an anonymous type:

var highGPAStudents = from student in students where student.GPA > 3.0 select new { First = student.First, Last = student.Last};

The anonymous type is created by the expression new { First = student.First, Last = student.Last}. Basically, it’s an object with a First and Last property (and no methods or other properties). Anonymous types are created by the interpreter at runtime (much like an auto-Property’s backing field). As such, we aren’t able to use its type in our code.

If we want the query to return a specific type, we can instead declare a struct or object to return, i.e.:

class StudentName {
    public string First;
    public string Last;
    public StudentName(string first, string last) {
        First = first;
        Last = last;
    }
}

And then set this as the projection type:

var highGPAStudents = from student in students where student.GPA > 3.0 select new StudentName(student.First, student.Last);

Let’s take a deeper look at LINQ syntax next.

LINQ Syntax

Let’s discuss some of the more common operations we might want to perform with LINQ. Please note that this is not an exhaustive list - just some of the most common operations you will be encountering at this time.

Data Source

For these examples, we’ll be using a data source consisting of Student objects. These are defined by the class:

public class Student {
    public string EID;
    public string First;
    public string Last;
    public double GPA;
    public int Age;    
    public string Major;
}

And the variable students is a List<Student> that we can assume is populated with initialized student objects.

We select this as our data source with the LINQ from operator. In query syntax, this would be:

var query = from student in students ...

And with method syntax, we would simply use the students list:

var query = students.SomeQueryMethod(student => ...)

To create a query from a data source using method syntax without applying any query methods (useful for chaining optional queries), we can invoke All() on the collection:

var query = students.All();

To use a different data source, we would just swap students for that source, an object that supports either the IEnumerable (usually data structures) or IQueryable (typically SQL or XML data sources) interface.

Projecting

Projecting refers to selecting specific data from a data source. For example, if we wanted to select the full name of every Student, we could do so with this query syntax:

var studentNames = from student in students select $"{student.First} {student.Last}";

Or with method syntax:

var studentNames = students.Select(student => $"{student.First} {student.Last}");

As the name is simply a string, the select above simply constructs the string, and the type of studentNames is inferred to be IEnumerable<string>.

We can also project an anonymous type. This is a special kind of object whose type is created at runtime. Anonymous types are basically collections of properties and nothing more (they cannot have methods). For example, if we wanted just the student’s full name and age, we would use this query syntax:

var studentInfo = from student in students select new {FullName = $"{student.First} {student.Last}", Age = student.Age};

or this method syntax:

var studentInfo = students.Select(student => new {FullName = $"{student.First} {student.Last}", Age = student.Age});

Finally, we could also define a new data type (i.e. class) and create an instance of it as our projection:

class StudentInfo {
    public string FullName {get; set;}
    public int Age {get; set;}
    public StudentInfo(string fullName, int age)
    {
        FullName = fullName;
        Age = age;
    }
}

Using query syntax:

var studentInfo = from student in students select new StudentInfo($"{student.First} {student.Last}", student.Age);

or this method syntax:

var studentInfo = students.Select(student => new StudentInfo($"{student.First} {student.Last}", student.Age));

Filtering

One of the most common operations you will do with a query is filter the data, so the results contain only part of the original data. This is done with the where operator takes a statement that resolves to a boolean. If this boolean expression resolves to true, then the data is included in the results; if it is false, it is excluded. For example, to find all students older than 25, we would use this query syntax:

var olderStudents = from student in students where student.Age > 25 select student;

or this method syntax:

var olderStudents = students.Where(student => student.Age > 25);

Filtering By Type

If we have a list that contains multiple types, we can filter for specific types with the where operator or the OfType operator (this is an instance where query and operator syntax vary more greatly). Consider the case where our Student class is a base class to GraduateStudent and UndergraduateStudent classes. If we wanted to get a list of only the undergraduates, we could use a where query syntax combined with an is casting test:

var undergraduates = from student in students where student is UndergraduateStudent select student;

In this case, the result would be an IEnumerable<UndergraduateStudent>. But the corresponding where in operator syntax would result in an IEnumerable<Student> that contained only UndergraduateStudent objects. To perform a cast as part of the filtering, we would instead use the OfType<T>() method:

var undergraduates = students.OfType<UndergraduateStudent>();

Ordering

Often we want to apply some form of sorting to our results, i.e. we might want to sort students by GPA. This can be done with an orderby operator. In query syntax it would be:

var studentsByGPA = from student in students orderby student.GPA select student;

And in method syntax:

var studentsByGPA = students.OrderBy(student => student.GPA);

The orderby operator sorts in ascending order (so students with the lowest grades would come first in the list). If we wanted to sort in descending order, we would need to specify descending order in our query syntax:

var studentsByGPA = from student in students orderby student.GPA descending select student;
Tip

There is also an ascending keyword you can use. This is helpful if you can’t remember the default or want to make it clear to other programmers that the list will be sorted in ascending order:

var studentsByGPA = from student in students orderby student.GPA ascending select student;

However, in method syntax this is accomplished by a separate operator, the OrderByDescending() method:

var studentsByGPA = students.OrderByDescending(student => student.GPA);

If we need to order by multiple properties, i.e. first and last names, this is accomplished by a comma-separated list in query syntax:

var studentsByName = from student in students orderby student.Last, student.First select student;

But in method syntax, we need to use a ThenBy() operator for subsequent sorting options:

var studentsByName = students.OrderBy(student => student.Last).ThenBy(student => student.First);

We can mix and match ascending and descending sorting as well - for example, to sort students by descending GPA, then by names in alphabetical order we would use the query syntax:

var studentsByGPAAndName = from student in students orderby student.GPA descending, student.Last, student.First select student;

The corresponding method syntax would need separate operators for each sorting:

var studentsByGPAAndName = students.OrderByDescending(student => student.GPA).ThenBy(student => student.Last).ThenBy(student => student.First);

There is also a ThenByDescending() operator for chaining descending sorts.

Finally, there is also a Reverse() operator which simply reverses the order of items in the collection without sorting.

Grouping

We often want to split our results into groups, which can be accomplished with the group by operator. Consider the case where we want to split our students by the value of their Major field. We can accomplish this with query syntax:

var studentsByMajor = from student in students group student by student.Major select student;

or using method syntax:

var studentsByMajor = students.GroupBy(student => student.Major);

The result type of a grouping operation is an IEnumerable<IGrouping<TKey, TSource>>; the IGrouping is essentially a key/value pair with the key being the type we were grouping by. In the example it would be IEnumerable<IGrouping<string, Student>> (Seeing this, you can probably begin to appreciate why we normally use var for query variables).

To print out each student in each category, we’d need to iterate over this collection, and then over the groupings:

foreach(var group in studentsByMajor) {
    Console.WriteLine($"{group.Key} Students");
    foreach(var student in group) {
        Console.WriteLine($"{student.First} {student.Last}");
    }
}

Paging

A common strategy with large data sets is to separate them into pages, i.e. the first 20 items might appear on page 1, and by clicking the page 2 link, the user could view the next twenty items, and so on. This paging functionality is implemented in LINQ using the Skip() and Take() operators. The Skip() operator specifies how many records to skip over, while the Take() operator indicates how many records to include. Thus, to take the second page of students when each page displays twenty students, we would use:

var pagedStudents = students.Skip(20).Take(20);

Note that there is no query syntax corresponding to the Skip() and Take() operations, so to use them with query syntax, we wrap the query in parenthesis and invoke the methods on the result. I.e. sorting students alphabetically and then taking the third page of twenty would be:

var pagedSortedStudents = (from student in students orderby last, first select student).Skip(40).Take(20);

Existence Checks

Sometimes we want to know if a particular record exists in our data source. The Any() operator can be used to perform such a check. It evaluates to true if the query has any results, or false if it does not. Like the Skip() and Take(), it does not have a query syntax form, so it must be invoked using the method syntax. For example, to determine if we have at least one student named Bob Smith, we could use:

var bobSmithExists = (from student in students where student.First == "Bob" && student.Last == "Smith" select student).Any();

Or, in method syntax:

var bobSmithExists = students.Any(student => student.First == "Bob" && student.Last == "Smith");

Alternatively, if we wanted to retrieve Bob Smith’s record instead of simply determining if we had one, we could use First():

var bobSmith = (from student in students where student.First == "Bob" && student.Last == "Smith" select student).First();

or in method syntax:

var bobSmith = students.First(student => student.First == "Bob" && student.Last == "Smith");

This evaluates to the first matching result of the query (if we have multiple Bob Smiths, we’ll only get the first one). If there is no matching record, an InvalidOperationException is thrown. Alternatively, we can use FirstOrDefault() which returns a default value corresponding to the query data type. For classes, this would be null, so given this query:

var bobSmith = students.FirstOrDefault(student => student.First == "Bob" && student.Last == "Smith");

The value of bobSmith would be his record (if he is in the collection) or null (if he was not).

Aggregating

Sometimes we want to perform aggregate operations upon a data source, i.e. counting, summing, or averaging. As with paging, these are accomplished via method-only operators. For example, to count all students in our data source, we could use:

var studentCount = students.Count();

This can be combined with any LINQ query, i.e. the count of students with a GPA above 3.0 in query syntax would be:

var studentsAbove3GPA = (from student in students where student.GPA > 3.0 select student).Count();

or in method syntax:

var studentsAbove3GPA = students.Where(student => student.GPA > 3.0).Count();

Similarly, to compute the average GPA we would use the Average() method in conjunction with a projection. In query syntax:

var averageGPA = (from student in students select student.GPA).Average();

or in method syntax:

var averageGPA = students.Select(student => student.GPA).Average();

or we can move the selector Predicate into the Average() directly:

var averageGPA = students.Average(student => student.GPA);

We can create more complex queries to address specific questions. For example, with a group by we could compute the average GPA by major:

var gpaByMajor = from student in students group student by student.Major into majorGroup select new 
    { 
        Major = majorGroup.Key,
        AverageGPA = majorGroup.Average(student => student.GPA);
    }

The Sum() operator works similarly, summing a value. To sum the ages of all students, we could use:

var sumOfAges = (from student in students select student.Age).Sum();

or

var sumOfAges = students.Select(student => student.Age).Sum();

or

var sumOfAges = students.Sum(student => student.Age);

There are also Min() and Max() aggregate operators which select the minimum and maximum values, respectively. For example, we could find the maximum and minimum earned GPAs with:

var minGPA = students.Min(student => student.GPA);
var maxGPA = students.Max(student => student.GPA);

Finally, there is a generic Aggregate() method which provides an aggregator variable that we can use to build any kind of aggregate function. Let’s first see how it can be used to duplicate the functionality of the Sum() method:

var sumOfAges = students.Aggregate((sum, student) => sum + student.Age);

Here, sum is inferred to be an int, as student.Age is an int. So it starts at 0, and each time the Aggregate method processes a student, it adds that student’s Age into the sum.

Now let’s use this method for something new - generating a string of email addresses for a bulk mailing. Assume our email application needs a list of semicolon-separated email addresses. In that case, we could generate the emails for all students from:

var emails = students.Aggregate((emails, student) => emails + $"; {student.EID}@k-state.edu");

If we had students with EIDs “mary”, “brb30”, and “stan”, the resulting string would be:

mary@k-state.edu; brb30@k-state.edu; stan@ksu.edu
Info

You may have heard the term map/reduce in the context of functional programming or big data. This is an algorithmic approach to processing data. This pattern can be duplicated in LINQ using a query as the mapping function, and Aggregate() as the reduce function.

Extension Methods

In order to use LINQ with your IEnumerable collections, you must include this using statement:

using System.LINQ 

Without it, the LINQ collection methods will not be available. You might be wondering why, as your collections are mostly defined in the System.Collections or System.Collections.Generic namespaces.

The answer is that LINQ on collections is implemented using extension methods. This is a C# feature that allows you to add methods to any class, even a sealed class. But how can we add methods to a sealed class? Isn’t the point of sealing a class to prevent altering it?

The answer is that extension methods don’t actually modify the class itself. Instead, they make available additional methods that the compiler “pretends” are a part of the class. But these are defined separate from the class, and cannot modify it, nor access private or protected members of the class. As the LINQ extension methods are defined in the System.LINQ namespace, we must make them available with a using statement before the compiler and intellisense will let us use them.

Let’s see an example of creating our own extension methods. As programmers, we often use class names in Pascal case, that might be useful to convert into human-readable strings. Let’s write an extension method to do this transformation:

using System.Text;

namespace StringExtensions {
    
    /// <summary>
    /// Converts a camel-case or pascal case string into a human-readable one 
    /// </summary>
    public static string Humanize(this String s)
    {
        StringBuilder sb = new StringBuilder();
        int start = 0;
        for(int i = 1; i < s.Length; i++) 
        {
            // An upper case character is the start of a new word
            if(Char.IsUpper(s[i]))
            {
                // So we'll add the last word to the StringBuilder
                string word = s.Substring(start, i - start);
                sb.Append(word);
                // Since that wasn't the last word, add a space 
                sb.Append(" ");
                // Mark the start of the new word 
                start = i;
            }
        } 
        // We should have one more word left 
        sb.Append(s.Substring(start));
        return sb.ToString();
    }

}

Notice a couple of important features. First, the method is defined as static. All extension methods are static methods. Second, note the use of this in the first parameter, this string s. The use of the this keyword is what tells C# the method is an extension method. Moreover, it indicates the class that is being extended - String (which is equivalent to string).

Other than these two details, the rest of the method looks much like any other method. But any time this method is in scope (i.e. within the StringExtensions namespace, or in any file with a using StringExtensions statement), there will be an additional method available on string, Humanize().

That’s all there is to writing an extension method. Go ahead and try writing your own to convert human-readable strings into Pascal or Camel case!

Summary

In this chapter we learned about LINQ, Microsoft’s query technology for querying collections, relational databases, and XML files (along with community implementations for many other queryable things). We saw how to use both query and method syntax provided by LINQ to perform common query operations. Finally, we examined the mechanism LINQ uses to provide new functionality to IEnumerable collections - extension methods. We even implemented a few extension methods of our own!

Now you should be ready to use LINQ in your own projects, as well as define your own custom extension methods.