Relational databases (also called SQL databases) provide a highly-structured storage mechanism. Data is organized into tables, with each column representing a single value and data type, and each row representing one entry. Conceptually, this organization is similar to tables you have seen in Excel and on the web. An example persons table is listed below:
Relational databases are often called SQL databases as we use Structured Query Language (SQL) to communicate with them. This is a domain-specific language similar to the LINQ you may have learned about in CIS 400 (actually, LINQ derives much of its syntax from SQL). Queries are streamed to a relational database across a socket or other connection, much like HTTP requests and responses are. The response is received also as text which must be parsed to be used.
SQL is used to construct the structure of the database. For example, we could create the above table with the SQL command:
CREATE TABLE persons ( id PRIMARY KEY, Last TEXT, First TEXT, );
SQL is also used to query the database. For example, to find all people with the last name “Smith”, we would use the query:
SELECT * FROM persons WHERE last='Smith';
You will learn more about writing SQL queries in the CIS 560 or CC 520 course. You can also find an excellent guide on W3C schools, with interactive tutorials. We’ll briefly cover some of the most important aspects of relational databases for web developers here, but you would be wise to seek out additional learning opportunities. Understanding relational databases well can make a great deal of difference in how performant your web applications are.
The key to performance in relational databases is the use of keys and indices. A key is a column whose values are unique (not allowed to be repeated). For example, the
id column in the table above is a key. Specifically, it is a sequential primary key - for each row we add to the table it increases, and its value is determined by the database. Note the jump from
3 - there is no guarantee the keys will always be exactly one more than the previous one (though it commonly is), and if we delete rows from a table, the keys remain the same.
An index is a specialized data structure that makes searching for data faster. Consider the above table. If we wanted to find all people with the last name “Smith”, we’d need to iterate over each row, looking for “Smith”. That is a linear $O(n)$ complexity. It would work quickly in our example, but when our table has thousands or millions of rows (not uncommon for large web apps), it would be painfully slow.
Remember learning about dictionaries or hash tables in your data structures course? The lookup time for one of those structures is constant $O(1)$. Indices work like this - we create a specialized data structure that can provide the index we are looking for, i.e. an index built on the
Last column would map
last => id. Then we could retrieve all “Smith” last names from this structure in constant time $O(1)$. Since we know the primary key is unique and ordered, we can use some kind of divide-and-conquer search strategy to find all rows with a primary key in our set of matches, with a complexity of $O(log(n))$. Thus, the complete lookup would be $O(log(n)) + O(1)$, which we would simplify to $O(log(n))$, much faster for a large $n$ than $O(n)$.
In truth, most SQL databases use Balanced Trees (B-Trees) for their indices; but the exact data structure is unimportant to us as web developers, as long as retrieval is efficient.
We can create an index using SQL. For example, to create an index on the column
last in our example, we would use:
CREATE INDEX last_index ON persons (last);
An index can involve more than one row - for example, if we expected to search by both first and last name, we’d probably create an index that used both as the key. The SQL to do so for both first and last names would be:
CREATE INDEX both_names ON persons (last, first);
Each index effectively creates a new data structure consuming additional memory, so you should consider which indices are really necessary. Any column or column you frequently look up values by (i.e. are part of the WHERE clause of a query) should be indexed. Columns that are only rarely or never used this way should not be included in indices.
The second important idea behind a relational database is that we can define relationships between tables. Let’s add a second table to our example, addresses:
|3||3||Cooper Street||Silver City||NM|
person_id is a foreign key, and corresponds to the
id in the persons table. Thus, we can look up the address of Lisa Merkowsky by her
id of 0. The row in the addresses table with the value of 0 for
person_id is “Anderson Ave., Manhattan KS”.
Note too that it is possible for one row in one table to correspond to more than one row in another - in this example Frank Styles has two addresses, one in Baltimore and one in Hollywood.
If one row in one table corresponds to a single row in another table, we often call this a one-to-one relationship. If one row corresponds to more than one row in another table, we call this a one-to-many relationship. We retrieve these values using a query with a JOIN clause, i.e. to get each person with their addresses, we might use:
SELECT last, first, street, city, state FROM persons LEFT JOIN addresses ON persons.id = addresses.person_id;
The result will also be structured as a table with columns
state containing a row for each person. Actually, there will be two rows for Frank, one containing each of his two addresses.
Finally, it is possible to have a many-to-many relationship, which requires a special table to sit in between the two called a join table. Consider if we had a jobs table:
|0||Doctor||A qualified practitioner of medicine; a physician.|
|1||Lawyer||A person who practices or studies law; an attorney or a counselor.|
|2||Producer||A person responsible for the financial and managerial aspects of making of a movie or broadcast or for staging a play, opera, etc.|
|3||Detective||A person, especially a police officer, whose occupation is to investigate and solve crimes.|
Because more than one person can have the same job, and we might want to look up people by their jobs, or a list of jobs that a specific person has, we would need a join table to connect the two. This could be named persons_jobs and would have a foreign key to both:
Thus Lisa is a doctor, Frank a producer, and Mary is both a doctor and detective! We could query for every doctor using a SQL statement with two joins, i.e.:
SELECT first, last FROM jobs LEFT JOIN persons_jobs ON jobs.id = persons_jobs.job_id LEFT JOIN persons ON jobs_persons.person_id = person.id WHERE jobs.name = 'Doctor';
As suggested before, this is just scraping the surface of relational databases. You’ll definitely want to spend more time studying them, as they remain the most common form of persistent storage used on the web and in other industries as well.