Chapter 7

Persistent Storage

Tupperware for the Web

Introduction

Of course, what made dynamic web servers interesting was that they could provide content built dynamically. In this approach, the HTML the server sends as a response does not need to be stored on the server as a HTML file, rather it can be constructed when a request is made.

That said, most web applications still need a persistent storage mechanism to use in dynamically creating those pages. If we’re dealing with a forum, we need to store the text for the posts. If we’re running an e-commerce site, we aren’t making up products, we’re selling those already in our inventory.

Thus, we need some kind of storage mechanism to hold the information we need to create our dynamic pages. We could use a variable and hold those values in memory, but we’d also need a mechanism to persist those values when our server restarts (as the contents of volatile memory are lost when power is no longer supplied to RAM).

Data Serialization

Perhaps the simplest persistent storage mechanism we can adopt is to use a combination of an in-memory variable and a file. For example, we could set up a simple database mechanism as a Node module:

const fs = require('fs');

/** @module database 
 * A simple in-memory database implementation, 
 * providing a mechanism for getting and saving 
 * a database object
 */ 
module.exports = { get, set };

// We retrieve and deserialize the database from 
// a file named data.json.  We deliberately don't 
// catch errors, as if this process fails, we want 
// to stop the server loading process
var fileData = fs.readFileSync('data.json');
var data = JSON.parse(fileData);

/** @function get()
 * Returns the database object 
 * @returns {object} data - the data object
 */
function get() {
    return data;
}

/** @function set()
 * Saves the provided object as the database data, 
 * overwriting the current object
 * @param {object} newData - the object to save 
 * @param {function} callback - triggered after save 
 */
function set(newData, callback) {
    // Here we don't want the server to crash on 
    // an error, so we do wrap it in a try/catch
    try {
        var fileData = JSON.stringify(newData);
        fs.writeFile("data.json", fileData, (err) => {
            // If there was an error writing the data, we pass it 
            // forward in the callback and don't save the changes
            // to the data object
            if(err) return callback(err);
            // If there was no error, we save the changes to the 
            // module's data object (the variable data declared above)
            data = newData
            // Then we invoke the callback to notify of success by sending 
            // a value of null for the error
            callback(null);
        });
    } catch (err) {
        // If we catch an error in the JSON stringification process,
        // we'll pass it through the callback 
        callback(err);
    }
}

In this module, we exploit a feature of Node’s require, in that it caches the value returned by the require() function for each unique argument. So the first time we use require('./database'), we process the above code. Node internally stores the result (an object with the get() and set() method) in its module cache, and the next time we use require('./database') in our program, this object is what is required. Effectively, you end up using the same data variable every time you require('./database'). This is an application of the Singleton Pattern in a form unique to Node.

Refactoring for Better Error Prevention

While this module can work well, it does suffer from a number of limitations. Perhaps the most important to recognize is that the get() method returns a reference to our aforementioned data variable - so if you change the value of the variable, you change it for all future get() function calls, while sidestepping the persistent file write embodied in our set(). Instead of providing a reference, we could instead provide a copy. A simple trick to do so in JavaScript is to serialize and then deserialize the object (turning it into a JSON string and then back into an object). The refactored get() would then look like:

/** @function get()
 * Returns A COPY OF the database object 
 * @returns {object} data - A COPY OF the data object
 */
function get() {
    return JSON.parse(JSON.stringify(data));
}

Note that there is a possibility this process can fail, so it really should be wrapped in a try/catch and refactored to use a callback to pass on this possible error and the data object:

/** @function get()
 * Provides a copy of the data object
 * @param {function} callback - Provides as the first parameter any errors, 
 * and as the second a copy of the data object.
 */
function get(callback) {
    try {
        var dataCopy = JSON.parse(JSON.stringify(data));
        callback(null, dataCopy);
    } catch (err) {
        callback(err);
    }
}

Notice that with this refactoring, we are using the same pattern common to the Node modules we’ve been working with. There is a second benefit here is that if we needed to convert our get() from a synchronous to asynchronous implementation, our function definition won’t change (the set() is already asynchronous, as we use the asynchronous fs.writeFile()).

Other Limitations

This database implementation is still pretty basic - we retrieve an entire object rather than just the portion of the database we need, and we write the entire database on every change as well. If our database gets to be very large, this will become an expensive operation.

There is also a lot of missed opportunity for optimizing how we get the specific data we need. As you have learned in your algorithms and data structures course, the right search algorithm, coupled with the right data structure, can vastly improve how quickly your program can run. If we think about the work that a webserver does, the retrieval of data for building dynamic HTML based on it is easily one of the most time-consuming aspects, so optimizing here can make each page creation move much faster. Faster page creation means shorter response times, and more users served per minute with less processing power. That, in turn means less electricity, less bandwidth, and less hardware is required to run your website. In heavily utilized websites, this can equate to a lot of savings! And for websites hosted on elastic hosting services (those that only charge for the resources you use), it can also result in significant savings.

Thus, we might want to spend more time developing a robust database program that would offer these kinds of optimizations. Or, we could do what most full-stack developers do, and use an already existing database program that was designed with these kinds of optimizations in mind. We’ll take a look at that approach next.

Databases

As we suggested in the previous section, using an already existing database application is a very common strategy for full-stack web development. Doing so has clear benefits - such programs are typically stable, secure, and optimized for data storage and retrieval. They are well beyond what we can achieve ourselves without a significant investment of time, and avoids the necessity of “reinventing the wheel”.

That said, making effective use of a third-party database system does require you to develop familiarity with how the database operates and is organized. Gaining the benefits of a database’s optimizations requires structuring your data in the way its developers anticipated, and likewise retrieving it in such a manner. The exact details involved vary between database systems and are beyond the scope of this course, which is why you have a database course in your required curriculum. Here I will just introduce the details of how to integrate the use of a database into your web development efforts, and I’ll leave the learning of how to best optimize your database structure and queries for that course.

In the web development world, we primarily work with two kinds of databases, which have a very different conceptual starting point that determines their structure. These are Relational Databases and Document-oriented Databases (sometimes called No-SQL databases). There is a third category, Object-oriented Databases that are not widely adopted, and a slew of less-well known and utilized technologies. There are also cloud services that take over the role traditionally filled by databases, which we’ll save for a later chapter. For now, we’ll focus on the first two, as these are the most commonly adopted in the web development industry.

Relational Databases

Relational databases (also called SQL databases) provide a highly-structured storage mechanism. Data is organized into tables, with each column representing a single value and data type, and each row representing one entry. Conceptually, this organization is similar to tables you have seen in Excel and on the web. An example persons table is listed below:

id	First	Last
0	Lisa	Merkowsky
1	Frank	Stiles
3	Mary	Cotts

Relational databases are often called SQL databases as we use Structured Query Language (SQL) to communicate with them. This is a domain-specific language similar to the LINQ you may have learned about in CIS 400 (actually, LINQ derives much of its syntax from SQL). Queries are streamed to a relational database across a socket or other connection, much like HTTP requests and responses are. The response is received also as text which must be parsed to be used.

SQL is used to construct the structure of the database. For example, we could create the above table with the SQL command:

CREATE TABLE persons (
    id PRIMARY KEY,
    Last TEXT,
    First TEXT,
);

SQL is also used to query the database. For example, to find all people with the last name “Smith”, we would use the query:

SELECT * FROM persons WHERE last='Smith';

You will learn more about writing SQL queries in the CIS 560 or CC 520 course. You can also find an excellent guide on W3C schools, with interactive tutorials. We’ll briefly cover some of the most important aspects of relational databases for web developers here, but you would be wise to seek out additional learning opportunities. Understanding relational databases well can make a great deal of difference in how performant your web applications are.

The key to performance in relational databases is the use of keys and indices. A key is a column whose values are unique (not allowed to be repeated). For example, the id column in the table above is a key. Specifically, it is a sequential primary key - for each row we add to the table it increases, and its value is determined by the database. Note the jump from 1 to 3 - there is no guarantee the keys will always be exactly one more than the previous one (though it commonly is), and if we delete rows from a table, the keys remain the same.

Indices

An index is a specialized data structure that makes searching for data faster. Consider the above table. If we wanted to find all people with the last name “Smith”, we’d need to iterate over each row, looking for “Smith”. That is a linear $O(n)$ complexity. It would work quickly in our example, but when our table has thousands or millions of rows (not uncommon for large web apps), it would be painfully slow.

Remember learning about dictionaries or hash tables in your data structures course? The lookup time for one of those structures is constant $O(1)$. Indices work like this - we create a specialized data structure that can provide the index we are looking for, i.e. an index built on the Last column would map last => id. Then we could retrieve all “Smith” last names from this structure in constant time $O(1)$. Since we know the primary key is unique and ordered, we can use some kind of divide-and-conquer search strategy to find all rows with a primary key in our set of matches, with a complexity of $O(log(n))$. Thus, the complete lookup would be $O(log(n)) + O(1)$, which we would simplify to $O(log(n))$, much faster for a large $n$ than $O(n)$.

Info

In truth, most SQL databases use Balanced Trees (B-Trees) for their indices; but the exact data structure is unimportant to us as web developers, as long as retrieval is efficient.

We can create an index using SQL. For example, to create an index on the column last in our example, we would use:

CREATE INDEX last_index ON persons (last);

An index can involve more than one row - for example, if we expected to search by both first and last name, we’d probably create an index that used both as the key. The SQL to do so for both first and last names would be:

CREATE INDEX both_names ON persons (last, first);

Each index effectively creates a new data structure consuming additional memory, so you should consider which indices are really necessary. Any column or column you frequently look up values by (i.e. are part of the WHERE clause of a query) should be indexed. Columns that are only rarely or never used this way should not be included in indices.

Relationships

The second important idea behind a relational database is that we can define relationships between tables. Let’s add a second table to our example, addresses:

id	person_id	street	city	state
0	0	Anderson Ave.	Manhattan	KS
1	1	Sesame St.	Baltimore	ML
2	1	Moholland Dr.	Hollywood	CA
3	3	Cooper Street	Silver City	NM

Here person_id is a foreign key, and corresponds to the id in the persons table. Thus, we can look up the address of Lisa Merkowsky by her id of 0. The row in the addresses table with the value of 0 for person_id is “Anderson Ave., Manhattan KS”.

Note too that it is possible for one row in one table to correspond to more than one row in another - in this example Frank Styles has two addresses, one in Baltimore and one in Hollywood.

If one row in one table corresponds to a single row in another table, we often call this a one-to-one relationship. If one row corresponds to more than one row in another table, we call this a one-to-many relationship. We retrieve these values using a query with a JOIN clause, i.e. to get each person with their addresses, we might use:

SELECT last, first, street, city, state FROM persons LEFT JOIN addresses ON persons.id = addresses.person_id;

The result will also be structured as a table with columns last, first, street, city, and state containing a row for each person. Actually, there will be two rows for Frank, one containing each of his two addresses.

Finally, it is possible to have a many-to-many relationship, which requires a special table to sit in between the two called a join table. Consider if we had a jobs table:

id	name	description
0	Doctor	A qualified practitioner of medicine; a physician.
1	Lawyer	A person who practices or studies law; an attorney or a counselor.
2	Producer	A person responsible for the financial and managerial aspects of making of a movie or broadcast or for staging a play, opera, etc.
3	Detective	A person, especially a police officer, whose occupation is to investigate and solve crimes.

(definitions provided by Oxford Languages)

Because more than one person can have the same job, and we might want to look up people by their jobs, or a list of jobs that a specific person has, we would need a join table to connect the two. This could be named persons_jobs and would have a foreign key to both:

id	person_id	job_id
0	1	1
1	2	2
2	3	1
3	3	3

Thus Lisa is a doctor, Frank a producer, and Mary is both a doctor and detective! We could query for every doctor using a SQL statement with two joins, i.e.:

SELECT first, last 
FROM jobs 
LEFT JOIN persons_jobs ON jobs.id = persons_jobs.job_id
LEFT JOIN persons ON jobs_persons.person_id = person.id
WHERE jobs.name = 'Doctor';

As suggested before, this is just scraping the surface of relational databases. You’ll definitely want to spend more time studying them, as they remain the most common form of persistent storage used on the web and in other industries as well.

SQL Injection

Along with the use of relational databases and SQL comes one very important attack to be aware of - SQL injection. This attack takes advantage of the way many developers write SQL queries within their programs. Consider the simple relational database we laid out earlier. Let’s assume our web application lets us search for people by their last names. To find Mary Cotts, we would then need a SQL query like:

SELECT * FROM people WHERE last='Cotts';

Since we want our users to be able to specify who to look for, we’d probably give them a search form in our HTML, something like:

<form>
    <label for="last">Last Name:</label>
    <input type="text" name="last">
    <input type="submit" value="Search">
</form>

And in our code, extract the name value from the query string and use it to construct the SQL query using string concatenation:

const querystring = require('querystring');

var url = new URL(req.url, "http://localhost");
var qs = querystring.parse(url.search.slice(1));
var name = qs.name;

var query = `SELECT * FROM people WHERE last='${name}';`;

The problem with this approach is a savvy adversary could submit a “name” that completes the SQL command and begins a new one, i.e.:

bob'; UPDATE users SET admin=true WHERE username='saavyhacker

Our naive concatenation approach then creates two SQL commands:

SELECT * FROM people WHERE last='bob'; UPDATE users SET admin=true WHERE username='saavyhacker';

When we run this query, our savvy hacker just made themselves an admin on our site (assuming our database has a users table with an admin column we use to determine their role)!

This is just the tip of the iceberg for SQL injection attacks - there are many, many variations on the theme.

Preventing SQL Injection

Every database driver provides the ability to build parameterized queries, i.e. a preformatted query that has “slots” where you assign values to. The exact mechanism to use these depends on the driver you are using to connect to the database. For example, if we were using the node-sqlite3 driver to connect our Node application to a SQLite database, we would use:

const querystring = require('querystring');

var url = new URL(req.url, "http://localhost");
var qs = querystring.parse(url.search.slice(1));
var name = qs.name;

var query = `SELECT * FROM people WHERE last=?;`;

// assuming a variable db is declared 
db.run(query, name, (err, result) => {
    // do something with result...
});

Because the slot corresponds to a specific column, the database engine converts the supplied argument to that type before applying it In this case, if our canny hacker uses his string, it would be interpreted as the literal last name “bob’; UPDATE users SET admin=true WHERE username=‘saavyhacker”. Which probably wouldn’t exist in our database, because what kind of parent would saddle a child with such a name?

Note

XKCD’s Exploits of a Mom

Object-Relational Mapping

If you think learning and writing SQL looks challenging, you’re not alone. Early full-stack developers did as well. In addition, there was the additional need to convert the responses from the relational database from text into objects the program could use. It shouldn’t be surprising that libraries quickly were adopted to manage this process. The most basic of these are drivers, simple programs which manage the connection between the database and the program using it, sending SQL queries to the database and parsing the results into data types native to the language (usually an array of arrays or an array of dictionaries - the outer array for the rows, and the inner array or dictionary for the column values within the row).

However, the use of drivers was soon supplanted by Object Relational Mapping (ORM), a software layer that sits between the database and the dynamic web server program. But unlike a driver, ORM libraries provide additional conversion logic. Most will convert function calls to SQL statements and execute them on the server, and all will convert the results of a query into an object in the language of the server.

Let’s look at a few ORMs in practice.

ActiveRecord

The ActiveRecord ORM is part of the Ruby on Rails framework. It takes advantage of the Ruby programming language’s dynamic nature to vastly simplify the code a developer must write to use a database. If we use our previous example of the database containing persons, addresses, and jobs tables, we would write classes to represent each table that inherit from the ActiveRecord base class:

# persons.rb
class Person < ApplicationRecord 
  has_many :addresses
  has_and_belongs_to_many :jobs
end

# addresses.rb 
class Address < ApplicationRecord 
  belongs_to :person 
end 

# jobs.rb 
class Job < ApplicationRecord 
    has_and_belongs_to :person
end

Then, to retrieve all people we would use a query method of the Job class::

allPeople = Person.all()

Or to retrieve all doctors, we would use a query method of the Job class:

allDoctors = Person.includes(:jobs).where(jobs: {name: "Doctor"})

While we haven’t been working in the Ruby language, I show this to help you understand just how far from SQL some ORMs go. For a Ruby programmer, this is far more comfortable syntax than SQL, much like LINQ’s method syntax is often more comfortable to C# programmers.

That said, there are always some queries - especially when we’re trying to squeeze out the most efficiency possible or working with a convoluted database - that are beyond the simple conversion abilities of an ORM to manage. So most ORMs offer a way to run a SQL query directly on the database as well.

Finally, you should understand that any ORM that provides a functional interface for constructing SQL queries is effectively a domain-specific language that duplicates the functionality of SQL. The SQL generated by these libraries can be far less efficient than a hand-written SQL query written by a knowledgeable programmer. While it can be more comfortable to program in, it may be more valuable to spend the effort developing a solid understanding of SQL and how relational databases work.

Entity Framework

The .NET platform has its own ORM similar to ActiveRecord, the Entity Framework. Like ActiveRecord, you can create the entire database by defining model objects, and then generating migration scripts that will create the database to match (this is called code first migrations in entity framework lingo). As with ActiveRecord, you must define the relationships between tables. To set up the objects (and tables) we saw in the ActiveRecord example, we would write:

public class Person 
{
  public int ID {get; set;}

  public string LastName {get; set;}

  public string FirstName {get; set;}

  public Address Address {get; set;}

  public ICollection<Job> Jobs {get; set;} 
}

public class Address 
{
  public int ID {get; set;}

  public string Line1 {get; set;}

  public string Line2 {get; set;}

  public string City {get; set;}

  public string State {get; set;}

  public int Zip {get; set;}

  public ICollection<People> People {get; set;}
}  

public class Job 
{
  public int ID {get; set;}

  public string Name {get; set;}

  public string ICollection<Person> People {get; set;}
}

Note that this is a lot more verbose; the Entity Framework approach doesn’t rely on the database telling it what the available, columns are, rather the model classes are used to determine what the database should be. We also have to set up the relationships, which are done using a class that extends DbContext, i.e.:

public class ExampleContext : DbContext 
{
  public DbSet<Person> Persons {get; set;}

  public DbSet<Address> Addresses {get; set;}

  public DbSet<Job> Jobs {get; set;}

  protected override void OnModelCreating(ModelBuilder modelBuilder)
  {
    modelBuilder.Entity<Person>().ToTable("People");
    modelBuilder.Entity<Address>().ToTable("Addresses");
    modelBuilder.Entity<Job>().ToTable("Jobs");
    modelBuilder.Entity<JobPerson>().ToTable("JobsPersons");
    modelBuilder.Entity<Person>()
      .HasOne(a => a.Address)
      .WithMany(b => b.Persons);
    modelBuilder.Entity<Person>().
      .HasMany(a => a.Jobs)
      .WithMany(b => b.Persons);
  }
}

Note the use of the .HasOne() and .HasMany() methods to set up the relationships between the tables. With this established, we can run queries using an instance of the ExampleContext:

var context = new ExampleContext();

// Get all people
var allPeople = context.Person.All();

// Get all doctors
var allDoctors = context.Person
  .Include(p => p.Jobs)
  .Where(p => p.Jobs.Includes(j => j.Name == "Doctor"));

Massive

For Node applications, I’ve really come to prefer MassiveJS, an ORM for the Postgres relational database. Like other ORMs, it converts query results into a JavaScript object or array of objects, with the column names as keys and values as values, converted to the appropriate type.

It does provide a slew of functions for generating queries programmatically, like other ORMs. But it also allows you to specify a folder in your project where you can place SQL files that are automatically converted to JavaScript functions that correspond to parameterized queries in the sql files. For example, to do our “All doctors” query, we would write a file scripts/allDoctors.sql:

SELECT * FROM persons INNER JOIN jobs_persons ON persons.id = jobs_persons.person_id INNER JOIN jobs ON jobs.id = jobs_persons.job.id WHERE jobs.name = "Doctor";

While this may seem less convenient than the ActiveRecord or Entity Framework approaches, it is immediately clear to a seasoned SQL programmer what the tables involved are and how they are related. Moreover, a good SQL programmer can often write a more efficient query than an ORM library can generate.