| |
“Dictionaries
are like watches; the worst is better than none,
and the best cannot be expected to go quite
true.”
— Mrs. Priozzi
Anecdotes of Samuel
Johnson, 1786

IN THIS CHAPTER,
YOU WILL LEARN:
-
Why we need a
data dictionary in a systems development
project;
-
The notation for
data dictionary definitions;
-
How a
data dictionary should be presented to the
user; and
-
How to implement
a data dictionary.
The second important
modeling tool that we will discuss is the data
dictionary.
Though it doesn’t have the glamour and graphical
appeal of dataflow
diagrams,
entity-relationship
diagrams, and state-transition diagrams, the data
dictionary is crucial. Without it,
your model of the user’s requirements cannot possibly
be
considered complete; all
you
will have is a rough sketch, an “artist’s rendering” of
the system.
The importance of a data dictionary is
often lost on many adults, for they have not used
a dictionary for 10 or 20 years. Try to think back
to
your elementary
school days, when you were
constantly besieged with new words in your schoolwork.
Think back also to your
foreign language courses, particularly the ones that
required you to read books and magazines. Without
a dictionary, you would have been lost. The
same is true
of a data dictionary in systems analysis: without
it, you
will be lost, and the user won’t be sure you have
understood the details of the
application.
The phrase data dictionary is almost
self-defining. The data dictionary is an organized
listing of all
the data elements that are pertinent to the system,
with precise,
rigorous definitions so
that both user
and systems analyst will have a common understanding
of all inputs, outputs, components of stores, and
intermediate calculations. The data dictionary
defines
the data elements by doing the following:
-
Describing the meaning
of the flows and stores shown in the dataflow diagrams.
-
Describing the composition
of aggregate packets of data moving along the flows, that is, complex packets
(such as a customer address) that can be broken into more elementary items
(such as city, state, and postal code).
-
Describing the composition
of packets of data in stores.
-
Specifying the relevant
values and units of elementary chunks of in
formation in the dataflows
and data stores.
-
Describing
the details of relationships between stores
that are highlighted in an entity-relationship diagram.
This aspect of the data dictionary will be discussed
in more detail in Chapter
12 after we have introduced the entity-relationship
notation.

10.1 THE NEED
FOR DATA DICTIONARY NOTATION
In most real-world
systems that you will work on, the packets, or
data elements, will be sufficiently
complex that you will need to describe them in terms of other
things. Complex data elements are
defined in terms of simpler data elements, and simple data elements
are defined in terms of the legitimate units and values they
can take on.
Think, for example, about the way you
would respond to the following question from a Martian (which
is the way many users think of systems analysts!) about the meaning
of a person’s
name:
Martian: “So
what is this thing called a name?”
You (shrugging
impatiently): “Well,
you know, it’s just a name. I mean, like, well,
it’s what we call each other.”
Martian (puzzled): “Does
that mean you can call them something different
when you’re
angry than when you’re happy?”
You (slightly amazed
at the ignorance of this alien): “No,
of course not. A name is the same all the time.
A
person’s name
is what we use
to
distinguish him or her
from other people.”
Martian (suddenly understanding): “Ahh,
now I understand. We do the same thing on my
planet. My name is 3.141592653589793238462643.”
You (incredulous): “But
that’s a number, not a name.”
Martian: “And
a very good name it is, too. I’m proud of it.
Nobody has anything close.”
You: “But what about
your first name? Or is your first name 3, and
your last name 1415926535?”
Martian: “What’s
this about first name and last name? I don’t
understand. I have only
one name, and it’s always the same.”
You: “Well, that’s not the way
it works here. We have a first name, and a last
name, and sometimes we have a middle name too.”
Martian: “Does that mean you
could be called 23 45 99?”
You: “No, we don’t allow numbers
in our names. You can only use the alphabetic
characters A through Z.”
As you
can imagine, the conversation could continue for
a very long time. You might think the example is
contrived,
because we rarely run into Martians who have no concept of the meaning
of a name. But it is not too far from the discussions
that take place (or should
take place) between a systems analyst and a user, in which the following
questions might be raised:
-
Must
everyone have a first name? What about the
character “Mr.
T” on the old TV series,
“The A Team”?
-
What
about punctuation characters in a person’s
last name; for example, “D’Arcy”?
-
Are
abbreviated middle names allowed, for example, “John X James”?
-
Is
there a minimal length required of a person’s
name? For example, is
the name “X Y” legal?
(One could imagine that it would wreak
havoc with many computer systems throughout
the country,
but is
there any legal/business reason why
a person couldn’t
give himself a first name of X and a last
name of Y?)
-
How
should we treat the suffixes that sometimes
follow a last name?
For example, the name “John
Jones, Jr.” is presumably legitimate,
but is the Jr. to be considered part of
the last name or
a special new category? And if it is a
new category, shouldn’t we allow numeric
digits,
too; for example, Sam Smith 3rd?
Note, by the way,
that none of these questions has anything to do
with the way we will eventually store the
information on a computer; we are simply trying to determine,
as a matter of business policy, what constitutes
a valid
name [1].
As you can imagine, it gets rather
tedious describing the composition of data elements in a rambling
narrative form. We need a concise, compact notation, just as
a standard dictionary like
Webster’s has a compact, concise notation for defining the
meaning of ordinary words.
10.2 DATA DICTIONARY
NOTATION
There are many common notational schemes used by systems
analyst. The one shown below is among the more common, and it uses a number
of simple symbols:
= is composed of
+ and
( ) optional (may be present or absent)
{ } iteration
[ ] select one of several alternative
choices
** comment
@ identifier (key field) for a store
| separates alternative choices in
the [ ] construct
As an example, we might define
name for our friendly Martian as follows:
name = courtesy-title
+ first-name + (middle-name) + last-name
courtesy-title = [Mr. | Miss
| Mrs. | Ms. | Dr. | Professor]
first-name = {legal-character}
middle-name = {legal-character}
last-name = {legal-character}
legal-character = [A-Z|a-z|0-9|'|-|
| ]
As you can see, the
symbols look rather mathematical; you may be worried
that it’s far too complicated to understand. As
we will soon see, though, the notation is quite easy to read. The
experience of
several thousands of IT development projects and several tens of
thousands of users has shown us that the notation
is also quite understandable to almost all
users if it is presented properly; we will discuss this in
Section 10.3.
10.2.1
Definitions
A definition of a data element is
introduced with the symbol “=”; in this context, the “=”
is read as “is defined as,” or “is composed of,” or
simply “means.” Thus, the notation
A = B + C
could
be read in any of the following ways:
To completely define a data element, our
definition will include the following:
-
The meaning of
the data element within the context of this user’s
application.
This is usually provided as a comment, using the “*
*” notation.
-
The composition of the data element, if it
is composed of meaningful elementary components.
-
The legal values
that the data element can take on, if it is an elementary data element that
cannot be decomposed any further.
Thus, if we were building a medical
system that kept track of patients, we might define the terms weight and
height in the following way:
weight =
* patient’s weight upon admission to the hospital
** units: kilograms; range: 1-200*
height =
* patient’s height upon admission to the hospital
** units: centimeters; range:
20-200*
Note that we have
described the relevant units and
the relevant range within matching “*” characters.
Again, this is a notational convention that many IT
organizations find useful, but it can be changed if necessary.
In
addition to the units and range, you
may also need to specify the accuracy or precision with
which the data element is measured. For a data element
like price,
for example, it is important to indicate whether the
values will be expressed in whole dollars, to the
nearest penny, and so
on [2]. And in many
engineering and scientific applications, it is important
to indicate the number of significant digits in the value
of data
elements.
10.2.2 Elementary Data
Elements
Elementary data elements are those for
which there is no meaningful decomposition in the context
of the user’s environment. This is often a matter of
interpretation and one that you must
explore carefully with the user. For example, we have
seen in the discussion above that the term name could
be decomposed
into last-name, first-name, middle-name,
and courtesy-title. But perhaps in
some user environments no such decomposition is necessary,
relevant, or even meaningful (i.e., where the terms last-name, etc.,
have no meaning to the user).
When we have identified elementary data
items, they must be entered in the data dictionary.
As indicated above, the data dictionary should provide
a brief
narrative
comment, enclosed
within “*” characters,
describing the meaning of the term within the
user’s context. Of course, there will be some terms
that are self-defining, that is, terms whose meaning
is universally
the same for all information systems, or
where the systems analyst might agree that no further
elaboration is necessary. For example, the following
might be considered
self-defining terms in a system
that maintains information about people:
current-height
current-weight
date-of-birth
sex
home-phone-number
In these cases,
no narrative comment is necessary; many systems analysts
will use the notation “**” to
indicate a “null comment” when the data element is self-defining.
However, it is important to specify the values and units of measure
that the elementary data item can take on. For example:
current-weight =
**
“units: pounds; range: 1-400*
current-height =
**
*units: inches; range: 1-96*
date-of-birth =
**
*units: days since Jan 1, 1900; range:
0-36500*
sex =
*values: [M | F]*
10.2.3 Optional Data
Elements
An optional data element, as the phrase
implies, is one that may or may not be present as a component of a composite
data element. There are many examples of optional data elements in information
systems:
-
A
customer’s name may or may not include a middle
name.
-
A
customer’s street address may or may not include
such secondary information
as an apartment number.
-
A
customer’s order may contain a billing address,
a shipping address, or possibly
both.
Situations like the last one must be
carefully verified with the user and must be accurately documented in the data
dictionary. For example, the notation
customer-address =
(shipping-address) + (billing-address)
means,
quite literally, that the customer-address might consist
of:
-
just a shipping-address; or
-
just a billing-address; or
-
a shipping-address and
a billing-address; or
-
neither a
shipping-address nor a
billing-address
This
last possibility is rather dubious. It is far more
likely that the user really means that the
customer-address must consist of a shipping-address or a billing-address
or both. This could be expressed in the following way:
customer-address =
[shipping-address
| billing-address | shipping-address + billing-address]
One
could also argue that, in a mail-order business,
one always needs a billing
address to ensure that the order will be paid for;
a separate shipping address (e.g., if the customer’s
accounting department is in a separate location)
is optional.
Thus, it is possible that the user’s real business
policy is better expressed
by
customer-address = billing-address +
(shipping-address)
But of course the
only way to know this is to ask the user and
to carefully
explain the
implications of the different notations shown
above [3].
10.2.4
Iteration
The iteration notation is used to
indicate the repeated occurrence of a component
of a data element. It is read as “zero or more occurrences of.” Thus,
the notation
order = customer-name + shipping-address +
{ item }
means
that an order must always contain a customer-name,
and must
always contain
a shipping-address, and will also
contain zero
or more occurrences of
an item. Thus, we may be
dealing with a customer who places
an order involving only one item
or two
items,
or someone on a
shopping binge who decides to order
397 different items [4].
In many real-world situations, the
user will want to specify upper and
lower limits
to the iteration. For instance, in the
example above,
the
user will probably
point out that it does not make sense
for a customer to place an order with
zero items; there must be at least
one item
in the order.
And the user
may want to specify an upper limit; perhaps
10 items is the most that will be allowed.
We can
indicate upper and lower limits in the
following way:
order = customer-name + shipping-address +
1{item}10
It’s okay to specify just a
lower limit, or just an
upper limit, or both or neither.
Thus, all of the following are
allowable:
a = 1{b}
a = {b}10
a = 1{b}10
a = {b}
10.2.5
Selection
The selection notation indicates
that a data element consists of exactly one of a set
of alternative choices. The choices are enclosed by the square
brackets “[” and
“]” and separated by the vertical bar “|” character.
Typical examples are:
sex = [Male | Female]
customer-type = [Government
| Industry | University | Other]
It is
important to review the selection choices
with the user to ensure that all possibilities
have been identified.
In the last example, the user might tend to concentrate
her or his attention on the “government,” “industry” and “university”
customers, and might require some prodding to remember that some customers fall
into the “none of the above” category.
10.2.6
Aliases
An alias, as the term implies, is
an alternative name for a data element. It is a common
occurrence when dealing with a diverse group of users,
often in different departments or different
geographical locations (and sometimes with different
nationalities and different languages), who insist
on using different names to mean the same thing.
The
alias is included in the data dictionary for completeness,
and it is cross-referenced to the primary or official
data name. For
example:
client =
*alias for customer*
Note that the definition
of client does
not show the composition (i.e., it does not
show that a client consists of a name, address, telephone
number, etc.). All this detail should be provided
only for the primary data name, in order to
minimize the redundancy in the
model [5].
Even though the data dictionary correctly
cross-references the aliases to the primary data
name, your should avoid using aliases whenever possible.
This is because the data names are usually first
seen, and are most visible to all users, on the dataflow
diagrams, where it
may not be obvious that customer and client are
aliases for one another. It is far better, if
at all possible, to get the users to agree on one
common
name [6].
10.3 SHOWING THE DATA DICTIONARY
TO THE USER
The data dictionary is created by the
systems analyst during the development of the system
model, but the user must be capable of reading
and understanding the data dictionary in order
to verify the
model. This raises some obvious questions:
-
Will the users be able to
understand the data dictionary
notation?
-
How should
the users verify that the dictionary is complete and
correct?
-
How is the
dictionary created?
The question of user acceptance of the
dictionary notation is a “red herring” in
most cases. Yes, the dictionary notation
looks somewhat mathematical; but, as we
have seen, the
number of symbols that the user has to
learn are very few. Users are accustomed
to a variety of formal notations in their
work and personal life; consider, for example,
the notation for musical scores, which
is far more
complex.

Figure 10.1:
Musical score notation
Similarly, the
notation for bridge, chess, and a variety of
other activities
is at least as complex as that of the data dictionary
notation shown in this chapter.

Figure 10.2:
Chess notation
The question of user
verification of the data dictionary usually
leads to
this question: “Should the users read
through the entire dictionary,
item by item, to ensure that it is correct?” It
is difficult to imagine that any user would
be willing
to do this! It is more likely
that the user will verify the correctness
of the
data dictionary in conjunction
with the dataflow diagram, entity-relationship
diagram, state-transition diagram,
or process specification that
he or she is reading.
There are a number of “correctness” issues
that the systems analyst can
carry out on his own, without
the assistance
of the user: he can ensure
that the dictionary is
complete, consistent, and non-contradictory.
Thus, he can examine the dictionary
on his own and ask the following
questions:
-
Has every flow on the dataflow
diagram been defined in the data dictionary?
-
Have all the components
of composite data elements been defined?
-
Has any data element been
defined more than once?
-
Has the correct notation
been used for all data dictionary definitions?
-
Are there any data elements
in the data dictionary that are not referenced in the dataflow diagrams,
entity-relationship diagrams, or state-transition
diagrams?
10.4 IMPLEMENTATION OF THE DATA
DICTIONARY
On a medium- or large-sized system, the
data dictionary can represent a formidable amount of work. It is not uncommon
to see a data dictionary with several thousand entries, and even a relatively
simple system will have several hundred entries. Thus, some thought must be
given to the way the dictionary will be developed, or the task is likely to
overwhelm the systems analyst.
The easiest approach is to make use of an
automated (computerized) facility to enter dictionary definitions, check them
for completeness and consistency, and produce appropriate reports. If your
organization is using any modern database management system (e.g., DB2, Oracle,
Sybase, Microsoft Access), a dictionary facility is already available. In this
case, you should take advantage of the facility and use it to build your data
dictionary. However, beware of the following potential
limitations:
-
You
may be forced to limit your data names to a
certain length (e.g., 15 or 32 characters). This
probably won’t be a major
problem, but you may find that your user may
insist on
a name such as destination-of-customer-shipment and
that your data dictionary
package forces you to
abbreviate the name to dest-of-cust-ship.
-
Other
artificial limitations may
be placed on the name. For example,
the hyphen character “-” may not be
allowed, and you may be forced to use the underscore “_” character
instead. Or you may be forced
to prefix (or suffix) all your
names
with a project
code
indicating the name of the
systems development project,
leading to
such names as
acct.pay.GHZ345P14.vendor_phone_number.
-
You may be forced to assign
physical attributes (e.g., the number of bytes, or blocks of disk storage,
or such data representations as packed decimal) to an item of data, even
though
it is not a matter of user policy. The data dictionary discussed in
this chapter should be an analysis dictionary and should not require
unnecessary or irrelevant
implementation decisions.
Some
systems analysts are also beginning to use automated
toolkit packages that include graphic facilities for
dataflow diagrams, and the like, as well as
data dictionary capabilities. Again, if such a facility
exists, you should make use of it. Automated toolkits
are discussed in more detail in Appendix
A.
If you have no automated facility for
building the data dictionary, you should at least be able to use a conventional
word-processing system to build a text file of data dictionary definitions. Or,
if you have access to a personal computer, you can use any of the common
file-management and database management programs (e.g., Microsoft Access for
Windows-based computers, or FileMaker for Macintosh computers) to construct and
manage your data dictionary.
Only in the most extreme case should you
resort to a manual data dictionary, that is, separate, 3-by-5 index
cards for each dictionary entry. This was often necessary, prior
to the 1990s;
even when
PCs were already widely deployed, it was discouraging to see how
many organizations kept their programmers and systems analysts
in the Dark Ages.
The cobbler’s children, as the saying goes, are usually the last
to get shoes. But
today, it is unforgivable; if you are working on a project where
you do not have
access to a data dictionary package or an automated analyst’s toolkit
or a personal computer or a word processing system, then you should
(1) quit
and
find a better job, or (2) get your own personal computer, or (3)
both of the above.
10.5 SUMMARY
Building a data dictionary is one of the
more tedious, time-consuming aspects of systems analysis. But
it is also one of the more important aspects:
without a formal dictionary that defines the
meaning
of all the terms, there can be no hope for precision.
In the next chapter, we will see how to
use the data dictionary and the dataflow diagram to build process
specifications for each of the bottom-level processes.

REFERENCES
-
J.D.
Lomax, Data Dictionary
Systems. Rochelle Park, N.J.: NCC Publications, 1977.
-
Tom
DeMarco, Structured Analysis
and Systems Specification. New York: YOURDON Press,
1979.
-
D. Kroenke, Database
Processing. Chicago: Science Research Associates,
1977.
-
Shaku Atre, Data Base: Structured
Techniques for Design, Performance, and Management.
New York: Wiley, 1980.

QUESTIONS AND
EXERCISES
-
Give a definition
of data dictionary.
-
Why is a data dictionary
important in
systems analysis?
-
What information
does a data dictionary provide about a data element?
-
What
is the meaning of the “=” notation
in a data dictionary?
-
What
is the meaning of the “+” notation
in a data dictionary?
-
What
is the meaning of the “(
)” notation in a data dictionary?
-
What
is the meaning of the “{
}” notation in a data
dictionary?
-
What
is the meaning of the “[ |
| ]” notation in a
data dictionary?
-
Do
you think the users you work with
can understand the standard
data dictionary
notation provided in this chapter?
If not, can you suggest
an alternative?
-
Give an example
of an elementary data item.
-
Give three examples
of optional data elements.
-
What are the possible
meanings of the following:
(a) address = (city) +
(state)
(b) address = street-address
+ city + (state)
+ (zipcode)
-
Give
an example of the use of the iteration {} notation.
-
What
is the meaning of each of the following notations:
(a) a = 1{b}
(b) a = {b}10
(c) a = 1{b}10
(d) a = 10{b}10
-
Does
it make sense to have an order defined
in
the following
way? Why or
why not?
order = customer-name
+ shipping-address
+ 6{item}
-
Give
an example of the selection
(“[
] ”)
construct.
-
What
is the meaning
of an alias in a
data dictionary?
-
Why
should the use of aliases
be
minimized wherever
possible?
-
What
kind of annotation
can be
used on a DFD
to indicate
that a data element
is an alias?
-
What
are the three
major
issues
when a user
looks
at a data
dictionary?
-
Do you
think the
users in
your organization
will
be able
to understand
data dictionary
notation?
-
Do
you think
that
the data
dictionary
notation
shown
in this chapter
is
more
complex or less
complex
than
musical
notation?
-
What
are
the three
error-checking
activities
that
the
systems analyst
can
carry
out
on a
data
dictionary
without
the
user?
-
What
are
the
likely
limitations
of
an
automated
data
dictionary
package?
-
Give
a data
dictionary definition
of customer-name
based on
the following
verbal specification
from a
user: “When
we record
a customer’s
name, we’re
very careful
to include
a courtesy
title. This
can be
either “Mr.,” “Miss,” “Ms.,” “Mrs.,” or “Dr.” (There
are lots
of other
titles like “Professor,“” “Sir,” etc.,
but we
don’t bother
with them.)
Every one
of our
customers has
a first
name, but
we allow
a single
initial if
they prefer.
Middle names
are optional.
And of
course, the
last name
is required;
we allow
a pretty
broad range
of last
names, including
names that
have hyphens
(“Smith-Frisby,” for
example) and
apostrophes (“D’Arcy”)
and so
forth. We
even allow
an optional
suffix, to
allow for
things like “Tom
Smith, Jr.” or “Harvey
Shmrdlu 3rd.”
-
What is
wrong with
the following
data dictionary
definitions:
(a)
a =
b c
d
(b)
a =
b +
+ c
(c)
a =
{b
(d)
a =
4{b}3
(e)
a =
{x)
(f)
x =
((y))
(g)
p =
4{6{y}8}6
-
In the
hospital example
of Section
9.2, what
are the
implications of
the definition
of height
and weight?
Comment: It
would imply
that we
are only
measuring in
integral units
and are
not keeping
track of
fractional centimeters,
and so
on.
-
Write
a
data dictionary
definition of
the information
contained on
your driver’s
license. If
you don’t
have a
driver’s license,
find a
friend who
does.
-
Write a
data dictionary
definition of
the information
contained on
a typical
bank credit
card (e.g.,
MasterCard or
Visa).
-
Write a
data dictionary
definition of
the information
contained in
a passport.
-
Write a
data dictionary
definition of
the information
contained in
a lottery
ticket.

FOOTNOTES
-
[1] On
the other hand, it is likely that the
business policy presently in place has been strongly
influenced
by
the computer systems that the organization
has been using for the past 30 years. Fifty years ago, someone
might have been considered eccentric if he decided
to
call himself “Fre5d Smi7th” but
it probably would have been accepted
by most organizations, because names were
transcribed onto
pieces of paper
by human hands. Early computer systems
(and most of the ones in place today) have a lot more trouble
with such nonstandard names.
-
[2] Not
only that, we need to specify whether we’re dealing in U.S. dollars,
Canadian dollars, Australian dollars, Hong Kong dollars, etc.
-
[3] There
is one possibility that might explain the
absence of both shipping address and billing
address in a customer order: the walk-in customer who wishes
to purchase an item and carry it away with him. It is likely
that we would want to explicitly identify
such a customer (by defining
a new data element called
walk-in that could have a value of true or false) because
(1) walk-in customers may need to be treated
differently
(for example, their
orders won’t have any
shipping charges), and (2) it’s a good way to double-check
and ensure that the missing shipping-address
or billing-address was
not a mistake.
-
[4]
Keep in mind once again that we are defining the
intrinsic business meaning of a data element without
regard to the technology that will eventually be
used to implement it. Eventually, for example, our
systems designers are likely to ask for a reasonable
upper limit on the number of different items that
can be contained in a single order. “In order
to make things work efficiently with our SUPERWHIZ
database management system, we’ll have to
restrict the number of items to 64. It’s unlikely
that anyone would want to order more than 64 different
items anyway, and if they do, they can simply place
multiple orders.” And the user may have his
own limitations, based on the paper forms or printed
reports that he deals with; this is part of the
user implementation model, which we will discuss
in Chapter
21.
-
[5] You
may wish to ignore this advice if you are using a computerized
data dictionary package that can manage and control the redundancy;
however, this is fairly
uncommon. The crucial thing to remember is that if we change
the definition of a primary data element (e.g., if we decide
that the
definition of a customer
should no longer include the phone-number) then the change
must apply to all the aliases as well.
-
[6] An
alternative is to annotate the flow on the
dataflow diagram to indicate that it is an
alias for something else; an asterisk, for example,
could be appended to the end of alias names.
For example, the
notation client* could be used to indicate that client
is an alias for
something else. But even this is
cumbersome.
|
|
|
|