strings
strings
Instances of the
string
class are immutable sequences of characters. Because string is a
class, it is a reference
type. Because
instances are immutable, once they are constructed, their contents
cannot change. Note that this does not mean that string variables
cannot change - we can assign a string variable s
the value “abc”
and later assign it the value “xyz”. These assignments simply assign to
s
references to different instances of the string class. What
immutability does mean is that there is no way to change any of the
characters in either of these instances (i.e., in either “abc” or
“xyz”). As a result, it is safe to copy a string by simply assigning
the value of one string variable to another; for example, if s
is
a string variable, we can write:
Note that this is not safe when dealing with mutable reference types,
such as arrays. For example, let a
be an int[ ] with at least
one element, and consider the following code sequence:
Because a
and b
refer to the same array, a[0]
is incremented as
well. This danger is absent for strings because they are immutable.
We access individual characters in a string by indexing; i.e., if
s
is a string variable, then s[0]
retrieves its first character,
s[1]
retrieves its second character, etc. For example, if s
refers
to the string, “abc”, then after executing
c
will contain ‘b’. Note that a statement like
is prohibited in order to enforce immutability.
We obtain the number of characters in a string using its
Length
property; for example:
Note
A string may have a length of 0. This means that it is the empty
string, denoted by “”. Note that "" is different from a null
reference - for example, if s
refers to “”, then s.Length
has a value of 0, but if s
is null, then this expression will
throw a NullReferenceException.
We can concatenate two strings using the +
operator. For example, if
s
refers to the string “abc” and t
refers to the string
“xyz”, then
will assign the string “abcxyz” to u
.
Because strings are immutable, building long strings directly
from many small pieces is very inefficient. Suppose, for example, that
we want to convert all the lower-case characters in the string
text
to upper-case, and to convert all upper-case letters in text
to
lower-case. All other characters we will leave unchanged. We can do this
with the following code:
string result = "";
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (char.IsLower(c))
{
result += char.ToUpper(c);
}
else if (char.IsUpper(c))
{
result += char.ToLower(c);
}
else
{
result += c;
}
}
Now suppose that text
contains 100,000 characters. Each iteration of
the loop executes one of the three branches of the if-statement,
each of which concatenates one character to the string accumulated
so far. Because strings are immutable, this concatenation must be
done by copying all the characters in result
, along with the
concatenated character, to a new string. As a result, if we were to
add up the total number of characters copied over the course of the
entire loop, we would come up with 5,000,050,000 character copies done.
This may take a while. In general, we say that this code runs in
O(n2) time, where n is the length of text
. This means
that as n increases, the running time of the code is at worst
proportional to n2. In the next
section, we will see
how we can do this much more efficiently using another data structure.
strings have many other methods to allow various kinds of
manipulation - see the documentation for the string
class
for details.
StringBuilders
StringBuilders
In the previous
section, we saw that
building large strings from small pieces by concatenating them
together is very inefficient. This inefficiency is due to the fact that
strings are immutable. In order to overcome the inefficiency of
concatenation, we need an alternative data structure that we can modify.
The
StringBuilder
class fills this need.
Like strings, StringBuilders implement sequences of characters,
but the contents of StringBuilders can be changed. The
StringBuilder class has six constructors. The simplest
StringBuilder
constructor
takes no parameters and constructs an empty StringBuilder — i.e., a
StringBuilder containing no characters:
StringBuilder sb = new();
We can then modify a StringBuilder in various ways. First, we may
append a char to the end using its
Append
method. This method not only changes the contents of the
StringBuilder, but it also returns a reference to it. Thus if we
have char variables, a
, b
, and c
, and a StringBuilder
variable sb
, we can write code such as:
sb.Append(a).Append(b).Append(c);
The first call to Append appends the contents of a
to sb
, and
returns sb
. Thus, the second call to Append also applies to sb
-
it appends the contents of b
. Likewise, the third call appends the
contents of c
.
Because this method changes a StringBuilder, rather than
constructing a new one, its implementation is very efficient - in most
cases, only the appended character needs to be copied (see
“Implementation of
StringBuilders”
for details). This class has other Append methods as well, including
one that appends the contents of a given
string.
This method likewise only needs to copy the appended characters.
Let us now return to the problem of converting all lower-case letters in
a string to upper-case, converting all upper-case letters to
lower-case, and leaving all other characters unchanged. We can use a
StringBuilder as an intermediate data structure to do this much more
efficiently than the code presented in the previous
section:
StringBuilder sb = new();
for (int i = 0; i < text.Length; i++)
{
char c = text[i];
if (char.IsLower(c))
{
sb.Append(char.ToUpper(c));
}
else if (char.IsUpper(c))
{
sb.Append(char.ToLower(c));
}
else
{
sb.Append(c);
}
}
string result = sb.ToString();
On most iterations, the above loop only copies one character. In
addition, the call to the StringBuilder’s ToString method copies
each character in the result. If text
is 100,000 characters long, this
is a total of 200,000 character copies. Using the StringBuilder
implementation described in the next
section, there
are some iterations that copy more than one character, but even if we
account for this, it turns out that fewer than 400,000 characters are
copied, as opposed to over five billion character copies when
strings are used directly (see the previous
section). The .NET
StringBuilder implementation performs even
better. In either case, the above code runs in O(n) time, where n
is the length of text
; i.e., as n gets large, the running time is at
worst proportional to n. Thus, its performance degrades much less
rapidly than the O(n2) code that uses strings
directly.
A program that runs the above code and the code given in the previous
section on user-provided
text files can be obtained by creating a Git repository (see “Git
Repositories”) using
this URL. A noticeable
performance difference can be seen on text files larger than 100K - for
example, the full text of Lewis Carroll’s Through the Looking
Glass.
StringBuilders have some features in common with strings. For
example, we access individual characters in a StringBuilder by
indexing; i.e., if sb
is a StringBuilder variable, then sb[0]
accesses its first character, sb[1]
accesses its second character,
etc. The main difference here is that with a StringBuilder, we may
use indexing to change characters; e.g., we may do the following,
provided sb
contains at least 3 characters:
A StringBuilder also has a
Length
property, which gets the number of characters contained. However, we may
also set this property to any nonnegative value, provided there is
enough memory available to provide this length. For example, we may
write:
If the new length is shorter than the old, characters are removed from
the end of the StringBuilder. If the new length is longer that the
old, chars containing the Unicode NULL value (0 in decimal) are
appended.
Warning
The Unicode NULL value is different from a null
reference. Because char is a value
type, a char
cannot store a null reference.
StringBuilders have many other methods to allow various kinds of
manipulation - see the documentation for the StringBuilder
class
for details. There is also a StringBuilder
constructor
that takes a string as its only parameter and constructs a
StringBuilder containing the contents of that string.
Implementation of StringBuilders
Implementation of StringBuilders
In this section, we will examine some of the implementation details of
the StringBuilder class. There are several reasons for doing this.
First, by examining these details, we can begin to understand why a
StringBuilder is so much more efficient than a string when it
comes to building long strings a character at a time. Second, by
studying implementations of data structures, we can learn techniques
that might be useful to us if we need to build our own data structures.
Finally, a computing professional who better understands the underlying
software will be better equipped to use that software effectively.
In what follows, we will develop an implementation of a simplified
StringBuilder class. Specifically, we will only implement enough to
support the program that flips the case of all characters in a
string (see the previous
section). Most other
features of a StringBuilder have a rather straightforward
implementation once the basics are done (we will show how to implement
an indexer in a later
section).
Note
The
implementation described here is much simpler than the .NET
implementation, which achieves
even better performance.
In order to illustrate more clearly the techniques used to implement a
StringBuilder, we will present an implementation that uses only
those types provided by the C#
core language, rather than those found in a library such as .NET. One
of the more useful data
structures that the C# core language provides for building more
advanced data structures
is the array. We can represent the characters in a StringBuilder
using a char[ ]. One difficulty in using an array, however, is
that we don’t know how many characters our StringBuilder might need.
We will return to this issue shortly, but for now, let’s just
arbitrarily pick a size for our array, and define:
/// <summary>
/// The initial capacity of the underlying array.
/// </summary>
private const int _initialCapacity = 100;
/// <summary>
/// The character in this StringBuilder.
/// </summary>
private char[] _characters = new char[_initialCapacity];
An array with 100 elements will give us room enough to store up to 100
characters. In fact, initializing the array in this way actually gives
us 100 characters, as each array element is initialized to a Unicode
NULL character (a char with a decimal value of 0). Because char
is a value type,
each array element is going to store a char - it’s just a question
of which char it is going to store. Therefore, if we want to be able
to represent a sequence of fewer than 100 characters, we need an
additional field to keep track of how many characters of the array
actually represent characters in the StringBuilder. We therefore
define:
/// <summary>
/// The number of characters in this StringBuilder.
/// </summary>
private int _length = 0;
Thus, for example, if _length
is 25, the first 25 characters in
_characters
will be the characters in the StringBuilder.
Because both fields have initializers, the default constructor will
initialize them both; hence, we don’t need to write a constructor. Let’s
focus instead on the Append method. This method needs to take a
char as its only parameter and return a StringBuilder (itself).
Its effect needs to be to add the given char to the end of the
sequence of characters in the StringBuilder.
In order to see how this can be done, consider how our fields together
represent the sequence of characters:
Within the array referred to by _characters
, the first _length
locations (i.e., locations 0 through _length
- 1) store the
characters in the StringBuilder. This means that
_characters[_length]
is the next available location, provided this is
a valid array location. In this case, we can simply place the char
to be appended in _characters[_length]
, increment _length
(because
the number of characters in the StringBuilder has increased by 1),
and return the StringBuilder.
However, what if we are already using all of the array locations for
characters in the StringBuilder? In this case, _length
is the
length of the array, and therefore is not a valid array location. In
order to handle this case, we need to make more room. The only way to do
this to construct a new, larger array, and copy all of the characters
into it. We will then make _characters
refer to the new array. (.NET actually provides a method to do all this, but in order
to show the details of what is happening, we will not use it.) Now that
there is enough room, we can append the new character as above. The code
is as follows:
/// <summary>
/// Appends the given character to the end of this StringBuilder.
/// </summary>
/// <param name="c">The character to append.</param>
/// <returns>This StringBuilder.</returns>
public StringBuilder Append(char c)
{
if (_length == _characters.Length)
{
char[] chars = new char[2 * _length];
_characters.CopyTo(chars, 0);
_characters = chars;
}
_characters[_length] = c;
_length++;
return this;
}
A few comments on the above code are in order. First, when we need a new
array, we allocate one of twice the size as the original array. We do
this for a couple of reasons. First, notice that copying every character
from one array to another is expensive if there are a lot of characters.
For this reason, we don’t want to do it very often. By doubling the size
of the array every time we run out of room, we increase the size by
enough that it will be a while before we need to do it again. On the
other hand, doubling the array doesn’t waste too much space if we don’t
need to fill it entirely.
The
CopyTo
method used above copies all of the elements in the array to which this
method belongs (in this case, _characters
) to the array given by the
first parameter (chars
in this case), placing them beginning at the
location given by the second parameter (0 in this case). Thus, we are
copying all the elements of _characters
to chars
, placing them
beginning at location 0.
The last statement within the if block assigns the reference stored
in chars
to _characters
; i.e., it makes _characters
refer to the
same array as does chars
. The last statement in the method returns the
StringBuilder whose Append method was called.
To complete this simple implementation, we
need to provide a ToString method. This method is already defined
for every object; hence, StringBuilder inherits this
definition by default. However, the ToString method defined for
objects doesn’t give us the string we want. Fortunately, though,
this method is a virtual method, meaning that we can re-define by
overriding it. We do this by using the keyword, override, in its
definition. Visual Studio®’s auto-complete feature is helpful here,
as when we type the word override, it presents us with a list of the
methods that can be overridden. Selecting ToString from this list
will fill in a template for the method with a correct parameter list and
return type.
We want this method to return the string formed from the first
_length
characters in _characters
. We can form such a string using
one of the string
constructors.
This constructor takes three parameters:
- a char[ ] containing the characters to form the string;
- an int giving the index in this array of the first character to
use; and
- an int giving the number of characters to use.
We can therefore define the ToString method as follows:
/// <summary>
/// Converts this StringBuilder to a string.
/// </summary>
/// <returns>The string equivalent of this StringBuilder.</returns>
public override string ToString()
{
return new string(_characters, 0, _length);
}
You can obtain a program containing the complete class definition by
creating a Git repository (see “Git
Repositories”) using
this URL. This program is a
modification of the program used in the previous
section to compare the performance
differences between using
strings or StringBuilders when building strings a
character at a time. Its only modification is to use this
StringBuilder class, defined within a class
library, instead of
the class defined in .NET. By running the program on long
strings, you can verify that the performance of this
StringBuilder class is comparable to that of the StringBuilder
in .NET.
Now that we have the details of a StringBuilder implementation, we
can begin to see why it is so much more efficient to build a string
a character at a time using a StringBuilder, as opposed to using a
string. As we have noted, allocating a new array and copying all
characters to it is expensive; however, we have tried to reduce the
number of times this is done. To see how this is accomplished, suppose
we are building a string of 100,000 characters. The first time we
need a larger array, we will copy 100 characters to a 200-element array.
The next time, we will copy 200 characters to a 400-element array. This
will continue until we copy 51,200 characters to a 102,400-element
array, which is large enough to hold all of the characters. If we add up
all of the character copies we have done when allocating new arrays, we
find that there are a total of 102,300 copies. In addition, each time we
call Append, we copy the char parameter to the array. This is
another 100,000 copies. Finally, the ToString method must copy all
of the characters to the string it is constructing. This is another
100,000 character copies, for a total of 302,300 copies. In general, the
number of character copies will always be less than 4n, where n is
the length of the string being built.