Language only partly to blame for buffer overflows

Coping with Character Strings in C
First of a three-part discussion

Conrad Weisert, August 5, 2003
©2003 Information Disciplines, Inc.

This article may be freely circulated, as long as the copyright credit is included.


Background and recent interest

Andrew Koenig & Barbara Moo1 offer more excellent advice in this month's C/C++ Users Journal. Although their article "Four First Steps to Modern C++ Programming" is aimed at programmers who became fluent in C before C++ was available, everyone who writes programs or manages projects with C/C++ should find the article helpful.

One of Koenig & Moo's four points is a plea to abandon, if you haven't done so already, C's clumsy and error-prone facilities for manipulating character strings. No knowledgeable programmer would disagree, but there is a bit more to the issue than the authors have presented. In this three-part series, I'd like to offer some additional advice.

What's wrong with C strings?

Just about everything.

C has the weakest character string capability of any general-purpose programming language. Strictly speaking, there are no character strings in C2, just arrays of single characters which are really small integers. If s1 and s2 are such "strings" a program cannot:

A set of standard C library functions (#include <string.h>) provides limited support for the first three, one step per call as in assembly language. By convention the end of a string is delimited by the non-printable null character (0 value), but there's no indication of the amount of memory allocated. Consequently, both user code and standard library functions can overwrite memory outside the space allocated for the array of characters.

Some C partisans have grown comfortable with all this and proud of their mastery of the techniques. Like COBOL enthusiasts of the 1960s, they may even claim to like them! Here's a well-publicized example:

"One of the great strengths of C from its earliest days has been its ability to manipulate sequences of characters."
-- P. J. Plauger, C/C++ Users Journal, July, 1995

What can a C programmer do?

Koenig & Moo correctly point out that the C++ programmer doesn't have to put up with any of that, but what if you're constrained, by contract or compiler environment, to write a program in pure C?

Enlightened organizations recognizing the dangers of C string handling, adopted programming standards to minimize the risks. One approach was to wrap strings in a structure like this:

  struct Str {int   space;
int size;
char* data;
};
typedef struct Str String;

That looks a lot like many C++ string classes, but, of course, you can't automate control over copying, assignment, and memory allocation as a C++ class does. Nevertheless, an organization can adopt disciplines for those operations, and prepare local standard library functions that support those disciplines. Application programs that use those library functions may be almost as clumsy to code and to read as raw C-library ones, but a lot less error-prone.

Application programs will refrain from manipulating the structure members directly, but will do so only through the safe library functions. In particular you can guarantee an end to the notorious and inexcusable buffer overflow bugs. For example here's a safer replacement for strcpy and strncpy:

    void stringcopy(String& s1, const String& s2)   // See note 1 
{assert(s1.space > s2.size);
s1.size = s2.size;
strcpy(s1.data, s2.data); // (or write out the loop)
}
Or, mimicking C++ assignment operators, this more automatic version:
    void stringcopy(String& s1, const String& s2)    //  See note 1
{if (s1.space <= s2.size)
{delete s1.data;
 s1.space = s2.size + 1;
s1.data = new char[s1.space]; // See note 2
}
s1.size = s2.size;
strcpy(s1.data, s2.data);
}

Memory leaks are still possible. Programs must remember to invoke a pseudo destructor function before allowing a String to pass out of scope, and should avoid making copies of a String structure. Memory leaks are fatal and sometimes hard to find, but, unlike buffer overflows, they don't invite malicious viruses.

Note 1: Pure C doesn't support reference parameters, but you can do the equivalent thing with pointers.
Note 2: In pure C you'd have to use the error prone malloc function instead of the new operator.

Summary

If C++ is available, you can and should use a string class. If not, although string handling using C's built-in and standard library facilities is a nightmare, you can make the best of it and avoid the most catastrophic bugs. Whenever a responsible software designer encounters a serious limitation in a tool, he or she takes steps to localize its impact and conceal it from the rest of the program.

In any case programmers and software vendors should stop blaming shortcomings in the programming language for their own serious blunders.

The next article in this three-part series will look at the relationship between character strings and objects.


1 -- Authors of Accelerated C++
2 -- except for the special case of quoted string literals

Return to Technical articles
Return to C++ topics
Return to IDI home page.

LAst modified August 17, 2003