So, now that we understand state, let’s talk about how we can serialize it in a way that is easy to parse and understand. There are really two major options we can choose from: a textual representation of the data, and a binary representation. Let’s look at text formats first.
Text Data Formats
There are many different ways that we can serialize data into a textual format. In fact, we’ve already covered how to read data from and write data to text files many times throughout this curriculum, and it is probably one of first things most programmers learn how to do.
At its core, the concept of serialization to a text file is pretty much the same as writing any data to a text file. We simply must write all the data stored in the program to a text file in an organized way. Then, when we need to load that file back into our program’s state, we can simply read and parse the data, storing it in objects and variables as needed.
So, let’s look at a simple example and explore the various ways that we could store this data in a textual format. Consider a
Person object that has a
age attribute. In addition, that object stores an instance of
Pet, which also has a
breed and an
With that structure in mind, there are several different formats we could use to store the data.
For many novice programmers, the first choice might be to simply create a custom text format to store the data. Here is one possible approach:
Name = Willie Wildcat
Age = 42
Name = Reggie
Age = 4
Breed = Shorkie
This format definitely stores all of the data in the program’s state, and it looks like it can easily be read and parsed back into the program without too much work. However, such a custom text format has several disadvantages:
- The code to create this text file and read it back into the program must be custom written for each type of object.
- The format doesn’t store any hierarchical structure - how do we know which pet belongs to which person?
- What if a person can have multiple pets, or their name includes a newline character?
Of course, all of these concerns can be addressed by adding either additional rules to the structure or additional complexity to the code for reading and writing these files. However, let’s look at some other widely used formats that already address some of these concerns and see how they compare. Many of them also already have pre-written libraries we can use in our code as well.
The Extensible Markup Language , or XML, is a great choice for data serialization. XML uses a format very similar to HTML, and handles all sorts of data structures and formats very easily. Here’s an example of the same state translated into an XML document:
As we can see, each object and attribute becomes its own tag in XML. We can even place the
Pet object directly inside of the
Person object, showing the hierarchical structure of the data.
XML also supports the use of a document type definition , or DTD, which provide rules about the structure of the XML document itself. So, using XML along with a DTD will make sure that the document is structured exactly like it should be, and it will be very easy to parse and understand.
A JSON representation of the state shown earlier is shown below:
"Name": "Willie Wildcat",
Another choice that is commonly used is YAML , a recursive acronym for “YAML Ain’t Markup Language.” YAML is very similar to JSON in many ways, and in fact JSON files can be considered valid YAML files themselves.
Here is a YAML representation of the same state:
Name: Willie Wildcat
YAML uses indentation to denote the hierarchical structure of the document, very similar to Python code. As we can see, the structure of a YAML document is very similar to the custom text format we saw earlier.
However, while this YAML document seems very simple, the YAML specification includes many features that are omitted in JSON, such as the ability to include comments in the data. Unfortunately, YAML also suffers from many of the same problems as Python code, such as the difficulty of keeping track of the indentation when manually editing a file, and the fact that truncated files may be interpreted as complete since there are no termination markers.