Addressing a long-standing complaint from non-C programmers . . .

Contiguous character strings in C++

©Conrad Weisert, January, 2010


The problem

Experienced programmers learning their first C-family language (C, C++, Java, C#) express shock and dismay when they learn that constant-length character strings within a record (struct or object) are not contiguous with the record. They're advised that only a pointer to the characters is a member of the record. The actual character-string data will be in an unnamed area of heap memory accessible only through that pointer. It follows that:

Those are troublesome and somewhat error-prone processes. In addition, the record formats are incompatible with other programming languages outside the C family.

The programmers may point out that the languages they've used before (Cobol, PL/1, etc.) imposed no such complications. A character-string data item was simply a member of the record.

The explanation

The origin of this complication lies in the original design of the C programming language in the 1970s. The language was intended mainly for systems programming1 as a substitute for machine-dependent assembly languages. Its focus was on machine words: integers, addresses, floating-point numbers. and sometimes individual bits. The designers2 believed they were keeping the language simple by not providing a character-string data type. Instead, on those occasions when the programmer had to deal with a character string, he or she could simulate it through an array of single characters (actually 8-bit integers). That was what an assembly-language programmer would do.

That expedient was further complicated by another expedient. In order to make array access efficient, subscripting was just a notational convenience for address arithmetic. The origin of an array was a pointer, which the programmer could increment. A program's access to array elements, including characters within a string, had to be through pointers.

That led to the notion that a character string is not an elementary data item but a container! Even now, four decades later, many C++ programmers will tell you that character strings are containers and have to be handled that way.

Decontainerizing character strings

Fortunately, C++ provides a handy way to handle strings as contiguous elementary data. When we first learn about class templates we see examples in which the classes are containers and the template parameters are the names of types or classes:

  template<type T> class Thing  { . .
Courses and textbooks often overlook another form of class template, where the template parameter is an integer:
 template<int size> class Cstring {  
    {char data[size];
     . . . 
    };
That template parameter will be specified as a constant whenever a client programmer declares a data item:
   class Product {
      Cstring<8>  identifierCode;
      Cstring<48> description;
      Money	  price;
      int         onHandQ;
      int 	  onOrderQ;
   };
How big is a Product object, assuming that a Money object occupies 8 bytes, an int is 4 bytes, and a char is a single byte? What will sizeof(Product) return? What does a programmer have to do to store a Product object in a database and retrieve it later? How would all that change if we used the STL's string class instead of Cstring?

Completing the Cstring class template

Obviously, we need to provide functionality to go with the data. We won't show the details in this article. In addition to the usual constructors, operators, and substring methods, we'll need methods to convert between this class and other string classes. To keep the representation pure and compatible with other languages let's not append that null-termination character that characterizes pseudo-strings in C.

Note that there is zero overhead in this class. The string object contains no pointer, no length field, no terminator, no reference count—just the value.

Unfortunately the compiler will produce a separate class with its full range of methods for every value of size that a client programmer uses to instantiate objects. To avoid burdening the program with dozens of Cstring classes organizations establish conventions to limit the options, e.g. only multiples of 8.

Most uses of Cstring will themselves be members of classes, as in the Product example above. rather than in inline application code. That helps to control the proliferation of separate Cstring classes.

When should we use Cstring?

For text manipulation, the STL's string class3 provides far more power and flexibility than our little contiguous string class. Cstring is recommended mainly for fields within a data record that:

If readers are interested, we'll show the code in a later article.


1—Specifically for implementing the UNIX operating system.
2—Mainly Dennis Ritchie at Bell Telephone Laboratories.
3—or one of the character-string classes that preceded the availability of the STL.

Return to home page
technical articles
C++ articles