Resources
Video Script
In this chapter, we’re going to discuss serialization, another very important topic to consider when building an application. Earlier in this course, we discussed the fact that object-oriented programs can be thought of as two different parts, the state of the program, consisting of the variables and data stored in memory; and the behavior of the program, which are the functions and source code of the program itself. Serialization is the process of saving a program’s state, and then restoring that state back into the program at some point in the future. Through serialization, we can work with our program to create the state we need, then save it and shut down our program. Later on, when we restart the program, we can load that saved state and be right back where we were, ready to continue working. It’s really one of the most fundamental things that most programs should be able to do.
When serializing data, we have a variety of file formats that we can choose from. The two most common are text files and binary files. When writing the state to a text file, we also have a choice of text formats, such as XML, JSON, and YAML, which all store structured data in slightly different ways, all the way to a custom file format of our own design. Each text-based file format has its advantages and disadvantages, which are discussed in the textbook. Finally, we can also store our state in a binary file format, which closely mimics the way the state is actually stored in memory while the program is running.
So, let’s discuss some of the pros and cons of each option. While data stored in a textual format might be a bit larger than its binary counterpart, it makes up for that by being fully portable between programming languages. This is a huge benefit for situations where the data generated by one program might need to be read by another. This is difficult to do using a binary format, but a text format such as JSON or XML is well suited to the task. Likewise, if we want users to be able to edit the file directly using a text editor, text files are the best option by far. Binary files, on the other hand, are typically much smaller and more efficient than text files, making them a great choice when we want things to be fast, or when storage space is limited. However, unlike text files, binary files are typically only readable by the language or even the program that created them, making them much less portable. Finally, due to their binary format, it is much more difficult to edit these files outside of the application that created them, making it harder to make quick changes to the program using a simple text editor. So, each option has definite advantages and disadvantages, and it is usually up to us as programmers to weigh the options and choose the best fit for our program.
When serializing data from a program, we should also be aware of the fact that some items simply don’t serialize well, and should not be included. For example, items in our state that were generated by the operating system itself, such as open files or threads, should not be serialized. This is because the state of those objects is tied to the operating system itself, and reloading the object into memory in our program won’t necessarily perform the operations needed for the operating system to do the same. So, we typically avoid serializing threads and open files directly, and instead choose to serialize the data used to construct those objects. That way, we can perform those actions again to reopen the file or create the threads again when the program is loaded in the future. Finally, to save space, we typically won’t serialize any data that is dependent on other pieces of data. For example, if our program has several variables that all point to the same object in memory, such as in a list graph data structure, we may only wish to store a single copy of each object in our output, greatly reducing the size of the output. When we read the file and load the state, we can read each object once and update all of the variables that should point to it.
Finally, we should also briefly mention databases. As you may know, a database is a special piece of software designed to help manage large amounts of data. It has special functionality to make the data easily searchable, and many enterprise-level databases are much faster and more reliable than simply storing data in large files. There are also database systems that operate on individual files themselves, instead of requiring massive dedicated servers. So, in practice, many times when we need to serialize the state of a program, we may choose to use a database instead of storing data directly to a file. However, learning about databases is a complex topic in itself, so we won’t cover them in this course. If you are interested in learning about databases, there is a later course in the Computational Core program that covers that topic in depth that you are welcome to take.
That covers the basics about serialization. The textbook includes some more detailed examples for reading and writing data in a variety of formats, both text-based and binary. Hopefully you’ll be able to put this knowledge to good use on your ongoing project. Good luck!