strings and StringBuilders :: Data Structures in C#

strings and StringBuilders

C# and .NET provide two data structures for representing sequences of characters - strings and StringBuilders. Each of these data structures has its own advantages and disadvantages. In this chapter, we will examine how these two types are used and implemented. In the process, we will note the tradeoffs involved in using one or the other.

strings

Instances of the string class are immutable sequences of characters. Because string is a class, it is a reference type. Because instances are immutable, once they are constructed, their contents cannot change. Note that this does not mean that string variables cannot change - we can assign a string variable s the value “abc” and later assign it the value “xyz”. These assignments simply assign to s references to different instances of the string class. What immutability does mean is that there is no way to change any of the characters in either of these instances (i.e., in either “abc” or “xyz”). As a result, it is safe to copy a string by simply assigning the value of one string variable to another; for example, if s is a string variable, we can write:

string t = s;

Note that this is not safe when dealing with mutable reference types, such as arrays. For example, let a be an int[ ] with at least one element, and consider the following code sequence:

int[ ] b = a;
b[0]++;

Because a and b refer to the same array, a[0] is incremented as well. This danger is absent for strings because they are immutable.

We access individual characters in a string by indexing; i.e., if s is a string variable, then s[0] retrieves its first character, s[1] retrieves its second character, etc. For example, if s refers to the string, “abc”, then after executing

char c = s[1];

c will contain ‘b’. Note that a statement like

s[0] = 'x';

is prohibited in order to enforce immutability.

We obtain the number of characters in a string using its Length property; for example:

int len = s.Length;

Note

A string may have a length of 0. This means that it is the empty string, denoted by “”. Note that "" is different from a null reference - for example, if s refers to “”, then s.Length has a value of 0, but if s is null, then this expression will throw a NullReferenceException.

We can concatenate two strings using the + operator. For example, if s refers to the string “abc” and t refers to the string “xyz”, then

string u = s + t;

will assign the string “abcxyz” to u.

Because strings are immutable, building long strings directly from many small pieces is very inefficient. Suppose, for example, that we want to convert all the lower-case characters in the string text to upper-case, and to convert all upper-case letters in text to lower-case. All other characters we will leave unchanged. We can do this with the following code:

string result = "";
for (int i = 0; i < text.Length; i++)
{
    char c = text[i];
    if (char.IsLower(c))
    {
        result += char.ToUpper(c);
    }
    else if (char.IsUpper(c))
    {
        result += char.ToLower(c);
    }
    else
    {
        result += c;
    }
}

Now suppose that text contains 100,000 characters. Each iteration of the loop executes one of the three branches of the if-statement, each of which concatenates one character to the string accumulated so far. Because strings are immutable, this concatenation must be done by copying all the characters in result, along with the concatenated character, to a new string. As a result, if we were to add up the total number of characters copied over the course of the entire loop, we would come up with 5,000,050,000 character copies done. This may take a while. In general, we say that this code runs in O(n²) time, where n is the length of text. This means that as n increases, the running time of the code is at worst proportional to n². In the next section, we will see how we can do this much more efficiently using another data structure.

strings have many other methods to allow various kinds of manipulation - see the documentation for the string class for details.

StringBuilders

In the previous section, we saw that building large strings from small pieces by concatenating them together is very inefficient. This inefficiency is due to the fact that strings are immutable. In order to overcome the inefficiency of concatenation, we need an alternative data structure that we can modify. The StringBuilder class fills this need.

Like strings, StringBuilders implement sequences of characters, but the contents of StringBuilders can be changed. The StringBuilder class has six constructors. The simplest StringBuilder constructor takes no parameters and constructs an empty StringBuilder — i.e., a StringBuilder containing no characters:

StringBuilder sb = new();

We can then modify a StringBuilder in various ways. First, we may append a char to the end using its Append method. This method not only changes the contents of the StringBuilder, but it also returns a reference to it. Thus if we have char variables, a, b, and c, and a StringBuilder variable sb, we can write code such as:

sb.Append(a).Append(b).Append(c);

The first call to Append appends the contents of a to sb, and returns sb. Thus, the second call to Append also applies to sb - it appends the contents of b. Likewise, the third call appends the contents of c.

Because this method changes a StringBuilder, rather than constructing a new one, its implementation is very efficient - in most cases, only the appended character needs to be copied (see “Implementation of StringBuilders” for details). This class has other Append methods as well, including one that appends the contents of a given string. This method likewise only needs to copy the appended characters.

Let us now return to the problem of converting all lower-case letters in a string to upper-case, converting all upper-case letters to lower-case, and leaving all other characters unchanged. We can use a StringBuilder as an intermediate data structure to do this much more efficiently than the code presented in the previous section:

StringBuilder sb = new();
for (int i = 0; i < text.Length; i++)
{
    char c = text[i];
    if (char.IsLower(c))
    {
        sb.Append(char.ToUpper(c));
    }
    else if (char.IsUpper(c))
    {
        sb.Append(char.ToLower(c));
    }
    else
    {
        sb.Append(c);
    }
}
string result = sb.ToString();

On most iterations, the above loop only copies one character. In addition, the call to the StringBuilder’s ToString method copies each character in the result. If text is 100,000 characters long, this is a total of 200,000 character copies. Using the StringBuilder implementation described in the next section, there are some iterations that copy more than one character, but even if we account for this, it turns out that fewer than 400,000 characters are copied, as opposed to over five billion character copies when strings are used directly (see the previous section). The .NET StringBuilder implementation performs even better. In either case, the above code runs in O(n) time, where n is the length of text; i.e., as n gets large, the running time is at worst proportional to n. Thus, its performance degrades much less rapidly than the O(n²) code that uses strings directly.

A program that runs the above code and the code given in the previous section on user-provided text files can be obtained by creating a Git repository (see “Git Repositories”) using this URL. A noticeable performance difference can be seen on text files larger than 100K - for example, the full text of Lewis Carroll’s Through the Looking Glass.

StringBuilders have some features in common with strings. For example, we access individual characters in a StringBuilder by indexing; i.e., if sb is a StringBuilder variable, then sb[0] accesses its first character, sb[1] accesses its second character, etc. The main difference here is that with a StringBuilder, we may use indexing to change characters; e.g., we may do the following, provided sb contains at least 3 characters:

sb[2] = 'x';

A StringBuilder also has a Length property, which gets the number of characters contained. However, we may also set this property to any nonnegative value, provided there is enough memory available to provide this length. For example, we may write:

sb.Length = 10;

If the new length is shorter than the old, characters are removed from the end of the StringBuilder. If the new length is longer that the old, chars containing the Unicode NULL value (0 in decimal) are appended.

Warning

The Unicode NULL value is different from a null reference. Because char is a value type, a char cannot store a null reference.

StringBuilders have many other methods to allow various kinds of manipulation - see the documentation for the StringBuilder class for details. There is also a StringBuilder constructor that takes a string as its only parameter and constructs a StringBuilder containing the contents of that string.

Implementation of StringBuilders

In this section, we will examine some of the implementation details of the StringBuilder class. There are several reasons for doing this. First, by examining these details, we can begin to understand why a StringBuilder is so much more efficient than a string when it comes to building long strings a character at a time. Second, by studying implementations of data structures, we can learn techniques that might be useful to us if we need to build our own data structures. Finally, a computing professional who better understands the underlying software will be better equipped to use that software effectively.

In what follows, we will develop an implementation of a simplified StringBuilder class. Specifically, we will only implement enough to support the program that flips the case of all characters in a string (see the previous section). Most other features of a StringBuilder have a rather straightforward implementation once the basics are done (we will show how to implement an indexer in a later section).

Note

The implementation described here is much simpler than the .NET implementation, which achieves even better performance.

In order to illustrate more clearly the techniques used to implement a StringBuilder, we will present an implementation that uses only those types provided by the C# core language, rather than those found in a library such as .NET. One of the more useful data structures that the C# core language provides for building more advanced data structures is the array. We can represent the characters in a StringBuilder using a char[ ]. One difficulty in using an array, however, is that we don’t know how many characters our StringBuilder might need. We will return to this issue shortly, but for now, let’s just arbitrarily pick a size for our array, and define:

/// <summary>
/// The initial capacity of the underlying array.
/// </summary>
private const int _initialCapacity = 100;

/// <summary>
/// The character in this StringBuilder.
/// </summary>
private char[] _characters = new char[_initialCapacity];

An array with 100 elements will give us room enough to store up to 100 characters. In fact, initializing the array in this way actually gives us 100 characters, as each array element is initialized to a Unicode NULL character (a char with a decimal value of 0). Because char is a value type, each array element is going to store a char - it’s just a question of which char it is going to store. Therefore, if we want to be able to represent a sequence of fewer than 100 characters, we need an additional field to keep track of how many characters of the array actually represent characters in the StringBuilder. We therefore define:

/// <summary>
/// The number of characters in this StringBuilder.
/// </summary>
private int _length = 0;

Thus, for example, if _length is 25, the first 25 characters in _characters will be the characters in the StringBuilder.

Because both fields have initializers, the default constructor will initialize them both; hence, we don’t need to write a constructor. Let’s focus instead on the Append method. This method needs to take a char as its only parameter and return a StringBuilder (itself). Its effect needs to be to add the given char to the end of the sequence of characters in the StringBuilder.

In order to see how this can be done, consider how our fields together represent the sequence of characters:

Within the array referred to by _characters, the first _length locations (i.e., locations 0 through _length - 1) store the characters in the StringBuilder. This means that _characters[_length] is the next available location, provided this is a valid array location. In this case, we can simply place the char to be appended in _characters[_length], increment _length (because the number of characters in the StringBuilder has increased by 1), and return the StringBuilder.

However, what if we are already using all of the array locations for characters in the StringBuilder? In this case, _length is the length of the array, and therefore is not a valid array location. In order to handle this case, we need to make more room. The only way to do this to construct a new, larger array, and copy all of the characters into it. We will then make _characters refer to the new array. (.NET actually provides a method to do all this, but in order to show the details of what is happening, we will not use it.) Now that there is enough room, we can append the new character as above. The code is as follows:

/// <summary>
/// Appends the given character to the end of this StringBuilder.
/// </summary>
/// <param name="c">The character to append.</param>
/// <returns>This StringBuilder.</returns>
public StringBuilder Append(char c)
{
    if (_length == _characters.Length)
    {
        char[] chars = new char[2 * _length];
        _characters.CopyTo(chars, 0);
        _characters = chars;
    }
    _characters[_length] = c;
    _length++;
    return this;
}

A few comments on the above code are in order. First, when we need a new array, we allocate one of twice the size as the original array. We do this for a couple of reasons. First, notice that copying every character from one array to another is expensive if there are a lot of characters. For this reason, we don’t want to do it very often. By doubling the size of the array every time we run out of room, we increase the size by enough that it will be a while before we need to do it again. On the other hand, doubling the array doesn’t waste too much space if we don’t need to fill it entirely.

The CopyTo method used above copies all of the elements in the array to which this method belongs (in this case, _characters) to the array given by the first parameter (chars in this case), placing them beginning at the location given by the second parameter (0 in this case). Thus, we are copying all the elements of _characters to chars, placing them beginning at location 0.

The last statement within the if block assigns the reference stored in chars to _characters; i.e., it makes _characters refer to the same array as does chars. The last statement in the method returns the StringBuilder whose Append method was called.

To complete this simple implementation, we need to provide a ToString method. This method is already defined for every object; hence, StringBuilder inherits this definition by default. However, the ToString method defined for objects doesn’t give us the string we want. Fortunately, though, this method is a virtual method, meaning that we can re-define by overriding it. We do this by using the keyword, override, in its definition. Visual Studio®’s auto-complete feature is helpful here, as when we type the word override, it presents us with a list of the methods that can be overridden. Selecting ToString from this list will fill in a template for the method with a correct parameter list and return type.

We want this method to return the string formed from the first _length characters in _characters. We can form such a string using one of the string constructors. This constructor takes three parameters:

a char[ ] containing the characters to form the string;
an int giving the index in this array of the first character to use; and
an int giving the number of characters to use.

We can therefore define the ToString method as follows:

/// <summary>
/// Converts this StringBuilder to a string.
/// </summary>
/// <returns>The string equivalent of this StringBuilder.</returns>
public override string ToString()
{
    return new string(_characters, 0, _length);
}

You can obtain a program containing the complete class definition by creating a Git repository (see “Git Repositories”) using this URL. This program is a modification of the program used in the previous section to compare the performance differences between using strings or StringBuilders when building strings a character at a time. Its only modification is to use this StringBuilder class, defined within a class library, instead of the class defined in .NET. By running the program on long strings, you can verify that the performance of this StringBuilder class is comparable to that of the StringBuilder in .NET.

Now that we have the details of a StringBuilder implementation, we can begin to see why it is so much more efficient to build a string a character at a time using a StringBuilder, as opposed to using a string. As we have noted, allocating a new array and copying all characters to it is expensive; however, we have tried to reduce the number of times this is done. To see how this is accomplished, suppose we are building a string of 100,000 characters. The first time we need a larger array, we will copy 100 characters to a 200-element array. The next time, we will copy 200 characters to a 400-element array. This will continue until we copy 51,200 characters to a 102,400-element array, which is large enough to hold all of the characters. If we add up all of the character copies we have done when allocating new arrays, we find that there are a total of 102,300 copies. In addition, each time we call Append, we copy the char parameter to the array. This is another 100,000 copies. Finally, the ToString method must copy all of the characters to the string it is constructing. This is another 100,000 character copies, for a total of 302,300 copies. In general, the number of character copies will always be less than 4n, where n is the length of the string being built.

strings and StringBuilders