Fad methodologies overlook central focus for user requirements . . .

Defining Data Items
© Conrad Weisert, Information Disciplines, Inc., Chicago
29 June 2003

Background -- one more missing component

The failed-project documentation I described last month lacked not only output specifications, but, worse, any data dictionary, i.e. a repository containing rigorous definitions of the data items referred to throughout the rest of the documentation. Some, but far from all, of the data items were defined within the body of a use case that referred to the data item.

A project team member explained that this is fine, because their methodology (he meant specifically the life-cycle) is "use-case driven". He assured me that this is the natural place for such information, just where a reader would be likely to look for a given data definition.

I later found a number of instances where the same data item was defined inside multiple use cases, in slightly different language. Even if no two of them are now in conflict, the perils of future change should be obvious.

Why is rigorous data definition so important?

Experience shows that misunderstandings about the meanings of data items account for many instances of failed projects, i.e. massive cost overruns, schedule slippages, or worse. Some of those misunderstandings become obvious only in the late stages of system testing and can be extremely expensive to correct. For example:

What kinds of data?

In his classic Structured Analysis and System Specification DeMarco1 calls for rigorous definition of composite data items, but, surprisingly, he ignored the equally vital definition of elementary data items. (Data Item Taxonomy explains the difference.)

Every data item referred to anywhere in the documentation must be defined in a data dictionary, unless its name is so self-documenting that its meaning is obvious to every reader with no chance of differing interpretations. Experience shows that surprisingly few data names meet that criterion.

Where do we find and identify data items?

The system output specifications are an excellent starting point. Once we know what information the system must produce, we can infer what data items are needed to produce that information. The systems analyst should make sure that every field listed in the content of any system output is either rigorously defined in the data dictionary or a simple function of other defined data items.

Defining data items is an ongoing process, however. By the end of the requirements definition phase we must have defined every data item of interest to the sponsoring users. Later, during design and programming, project team members will continue to identify and define data items.

What data dictionary?

If your organization maintains a corporate or global repository of data definitions then you'll probably find some of the data items you need already defined, and your project will contribute some new ones.

But if your organization has no such central data dictionary, you'll need to establish a data dictionary for your project. A number of C.A.S.E. tools support (or claim to support) a data dictionary, but even if you can't find a suitable affordable one, you still need a project data dictionary. A simple spreadsheet or even index cards, crude as they are, are far better than no data-dictionary at all. The criterion is understandability, not ease of maintenance.

Defining a composite data item

A composite data item is fully defined by listing its components. DeMarco proposed a notation that some systems analysts continue to use, but any equivalent is acceptable as long as the user-audience can understand it with minimal instruction.

DeMarco's notation supported conditional inclusion, alternative fields, and repeating groups. Note that we're concerned here with defining the meaning of data items, not yet with normalizing a relational data model. Here's an example:

  order  = order_number
         + customer_id
         + shipping_address
         + billing_address
         + 1{order_item}maxItems
         + [discount coupon]
         + (credit_card_id | cash_payment_record)
Based on this definition we can answer a number of questions, such as: If those answers are unacceptable, now is the time to find out, not five weeks before the final delivery target date.

With the possible exception of credit_card_id, none the component data items has a self-defining name. Further data-dictionary entries are mandatory.

Defining an elementary data item

While a composite data item is defined in terms of its components, an elementary data item is defined in terms of its exact real-world meaning. That can be expressed in carefully worded text:

Date Hired: The most recent date upon which the employee reported for work; i.e. the date from which the current or former employee was continuously on our staff.
or as a simple formula
  vacation_days_due = current_vacation_days_earned
                    + carryover_vacation_days
                    - current_vacation_days_used 

Attributes of numeric data items.

If the elementary data item is numeric, then the definition should specify these attributes:

Defining a data type -- object-orientation

Even if our project doesn't plan to exploit object-oriented technology, we can greatly simplify the data dictionary by "factoring out" common properties and defining new data types in the data dictionary. This eliminates a lot of repetitious redefinition and relieves the systems analysts of the need to make (possibly inconsistent) choices for each data item.

For example, when we define streetAddress as a type, we settle once and for all questions about how many lines should be provided, or whether to include the (redundant) state code. We can then define the composite data item shipping_address as simply an instance of streetAddress, and billing_address as either a streetAddress or a P.O._box_address.

New data types can themselves be defined in terms of other previously defined data types. UML class diagrams can help the audience to visualize the relationships.

Further guidance

The foregoing is barely a shallow introduction to this complex topic. IDI offers a course that explores data definition in depth.

1 -- Tom DeMarco: Structured Analysis and System Specification, 1978, Yourdon Press, ISBN 0-917072-07-03

Return to Requirements Guidelines
Return to IDI Home Page