Why is this everyday problem so complicated?

Comparing Character Strings

by Conrad Weisert
November 2, 2013
©2013 Information Disciplines, Inc.


Background

I have to confront the character-string ordering issue twice this month:

I have to offer some useful advice to both groups about matching and ordering commonly-occurring data items such as names of people. We have to sort lists of names in order to produce directories. Let's review some of the subtleties we have to cater to.

Case insentitivity

I sometimes provide to students a list of people's names and ask them to write a program to produce a sorted list. If the list includes de la Renta,Oscar and De Gaulle,Charles they may be surprised to discover Oscar de la Renta at the end of their sorted results. Once they understand character encoding, of course, they know that lower-case characters come after upper-case, and they start thinking of ways to do a case-independent comparison.

At worst, they may have to invoke case conversion methods:
  if (s1.toUpper() > s2.toUpper()) swap(s1,s2);
or they may be using a string class that provides a case-independent comparison option.

Diacritical marks

A similar character-conversion strategy can be used to compare some strings containing characters that carry accent marks, such as é, and ü. In most European languages those characters collate with their unaccented equivalents, but there are exceptions. In Swedish, for example, ö is a distinct character at the end of alphabetic sequence while in German the same graphic just collates with the letter o. Norwegians and Danes avoid that confusion by using the graphic ø instead of the Swedish version, but then disputes may arise in designing general pan-Scandinavian lists.

Double character options

When we learn German, we accept the equivalence of schloss and schloß.1. That's one character position equivalent to two. When should an application observe that equivalence, and when it is all right to ignore it? Similar choices exist in other languages; for example, some Norwegians have catered to international alphabets by changing å to aa, but a Norwegian telephone directory may list them together.

Bottom line

If you're designing an application or a data base that's intended to serve customers, members, vendors, or others in multiple cultures, you'll need to document thoroughly the support your application will give to various alphabets and you'll have to obtain agreement in advance from stakeholders.

On the other hand, if you're developing an application to be based in a single culture and expect it to draw its customers, members, or other user mainly from that culture, you just need to support the kinds of information your users will use and expect. At the very minimum it should support lower-case letters in their dictionary sequence and those single-character accented letters that we expect to collate with their unaccented equivalents. You may have to code custom comparison operators or you may find what you need in a standard component library.


1—The spelling reform of 1996 endorses ss in some situations where ß was formerly standard.

Return to Technical articles
IDI Home page

Last modified November 2, 2013