Are developers being pressured into using an unsuitable tool?

Some doubts about XML

©Conrad Weisert, Information Disciplines, Inc.,
14 March 2006

"Despite its document-centric roots, XML is evolving to the data format of choice for the transmission and storage of information on the Internet."

- Harold & Means: XML in a Nutshell, O'Reilly, ISBN 0-596-00058-8, p. 225.

The so-called "'extensible markup language" (XML) occupies a central position in new computer applications, especially web-based systems. A command of XML and its related tools and auxiliary languages is required of applicants for developer roles, and the use of XML is often taken for granted for many kinds of data.

Early evidence raises concern that:

It's too soon to judge whether those shortcomings are minor annoyances or serious obstacles. I'd like to share some of my concerns and solicit your feedback. Here are the things that so far make me uneasy about XML:

  1. XML was intended as a markup language for documents, but it's now often being used for structured data records. That requires embedding the record description ("markup") in every record, even when the records (or relational rows) all have the same structure.

    Even if the record-description notations were compact, that would be extremely wasteful for a file of identically structured records. Documents and databases are not at all the same thing.

  2. But XML's record-description notations, embedded pairs of tags, are far from compact, often consuming much more space than the data items themselves. For example I took this from a pro-XML demonstration:
              <location>Central Park</location>
              <time>08:10</time>
              <temperature>31.0</temperature>
              <humidity>44%</humidity>
  3. Books and articles promoting XML cite the character representation of both tags and data as a plus, since an XML file is then "human readable". But the humans who read them almost always do so at a computer, which would find a more efficient format not only more compact but also simpler and less error prone. Does any processor today not understand a simple binary integer?

  4. Unlike HTML XML requires that tags must be well nested. That is:
    This is legal
     <a> <b> <c> 
        data
    </c></b></a>
    but this isn't:
     <a> <b> <c> 
        data
    </a></b></c>
    That's a reasonable requirement, but shouldn't it relieve us of having to repeat the name in the closing tag? There would be no ambiguity in just using </> instead of </ProductName>. The name in the closing tag serves no purpose at all. (In fact, for further economy, a named closing tag could be used to denote multiple closure like this: <a><b><c> data </a>.)

    The combination of repeated record descriptions, clumsy character representation, and redundant closing tags can easily yield an order of magnitude more bytes than the data require. For example we'd expect the value of a temperature reading to fit in 4 bytes, while "<temperature>31.0</temperature>" consumes 31 bytes!1 That's sure to inflict serious performance penalties, especially when transmitted over the Internet or other telecommunication facilities. (And that doesn't even count the overhead of scanning and decoding a numeric data item before a program can do arithmetic on it.) Storage and transmission capacity have become orders of magnitude less expensive that they used to be but they're still far from free or infinite. That suggests that XML is unsuitable for high-volume applications.

  5. Although "X" stands for "extensible", the language itself doesn't provide a facility for defining extensions – no variables, no named constants, no arithmetic or character-string expressions. Furthermore, the syntax inconsistently leads one to expect such capabilities; for example, every attribute value must be enclosed in quotes, even a numeric literal:
    . . width="100" bgcolor="white".
    What function do those quotation marks serve?

    A 40-year-old principle of formal language design2 asserts that:

    A language that disregards this long-accepted principle is unnecesarily crude and user-unfriendly (and hardly extensible).

XML is an evolving standard, and I may have overlooked features that are newer than my reference books. Let me know (cweisert@acm.org):

and I'll update this article accordingly or provide links to other relevant material.
1 -- That's assuming ASCII single-byte characters. With Unicode double-byte characters, which are standard in Java, that 4-byte data item would consume 62 bytes.

2 -- I first heard this from Alan Perlis.

Return to IDI home page

Last modified March 16, 2006