Discovering Digital Resources
Two Approaches to Data Description
Hans Joergen Marker
Data description is a prerequisite for the preservation of
machine-readable data. Information description is also a necessary
part of archival activity. In order to be able to retrieve
information, you need to describe it. Thus the data archives
and the general archives both need the means for description
of data and the problem has been addressed in both settings.
The solutions proposed are very different, and thus worthwhile
to investigate closer.
The Standard Study Description
The Standard Study Description originates from the Social
Science Data Archive tradition. It was developed in the early
seventies. The data that it was intended to describe was flat
file social science data, especially survey data. The mothers
of the SSD were very conscious of the available tools for
storage and retrieval of information. As these tools were
fairly crude at the time the SSD is a rather ugly thing. Appearance
alone is not enough to judge the SSD on. The SSD is basically
a file structure upon which applications can be build. The
applications need not to be ugly because the systems files
are. The information to be provided in the SSD is divided
into items which in turn have numbers. Most of the items are
codes with a finite and often small number of allowed code-values.
The emphasis on coding was not only due to traditions of social
science data processing, but also a deliberate attempt to
make the study description independent of language. In the
SSD the unit being described is a data material, which is
supposed to consist of one rectangular file with certain characteristics.
If a research project result in a data collection which is
organised in another way some items of the SSD becomes difficult
to fill in (for instance 211 and 212).
As the SSD is a real life thing which is actually being used,
it is not surprising that the items of the SSD are use somewhat
differently at different archives. The many items being offered
are not all ways relevant to particular studies and some items
though perhaps relevant are rarely used because archive practices
just does not work that way. Not many study descriptions have
the 500-items filled in, even though most studies have the
background variables sought for in this group of items. Viewing
the collection of data at a data archive as consisting of
separate items (studies), which are distinct and can be handled
and described separately in a similar manner is very much
in line with traditional data archive thinking. Actually it
forms the backbone of data archive practices. It was very
true at the time when the SSD was conceived. But this point
of view has since been challenged by reality. Essentially
the challenge has been met with work-arounds such as for instance
splitting the data produced by a particular research project
in to several studies.
A real life example with some problems
In the years 1971 to 1973 the Danish historian Hans Christian
Johansen made a study of rural demography in the 18th century
(1). The source for this study was parish
registers from the period 1741 to 1801 and the census-lists
from 1787 and 1801 from 26 selected country parishes. The
information from these sources were key punched and a family
reconstitution was carried out. The data from this study was
deposited at the DDA on May 7th 1976. These data were far
from the single rectangular file surveys for which the SSD
(and data archive practices in general) was designed. The
work-around employed was to construct six studies:
DDA-0101: Christenings in Selected Rural Parishes, 1741-1801,
DDA-0102: Burials in Selected Rural Parishes, 1741-1801,
DDA-0103: Weddings in Selected Rural Parishes, 1755-1801,
DDA-0106: Reconstituted families in Selected Rural Parishes,
1741-1801,
DDA-0181: Census Register for Selected Rural Parishes 1787,
DDA-0182: Census Register for Selected Rural Parishes 1801.
The resulting SSD's can be obtained from the DDA-web-site
(http://www.dda.dk/). Needles
to say these SSD s show some duplication. All the background
stuff about principal investigator, conduction of the study,
deposition at the archive, publications and accessibility
is completely identical in the six study descriptions. Usually
we do things a little differently today than we did 20 years
ago, but the example illustrates the difficulties involved
in applying the SSD very nicely.
The General International Standard Archival Description
The ISAD(G)(2) is much younger than the
SSD, and thus it is designed with much more advanced data
processing in mind. Consequently it is not nearly as ugly
as the SSD. The ISAD(G) has been developed in the early nineties
by an ad hoc commission sponsored by UNESCO and various national
archives. The ISAD(G) is aimed at archival material in general
and not as the SSD specifically directed toward machine-readable
files. The ISAD(G)(3) is a description
hierarchy based on four rules. Within the framework of these
rules a number of Elements of description are applied. The
consequence of this approach is that the archive holdings
are viewed as a hierarchy in the extreme case with the entire
archive on top and single items of information in the bottom.
Between top an bottom the divisions of the hierarchy are the
levels of description. In principle all element of description
could be used on all levels. A particular bit of information
is given at the lowest level where it does not result in duplication.
So what about the rural parishes in ISAD(G)
I haven t actually made a full ISAD(G) description of the
data from Hans Christian Johansen s study of rural demography.
I would propose a description on two levels study and file,
disregarding the variable level in this example. Another possible
level of description would be parish. The information about
study, principal investigator, accessibility and references
goes on the study level. Description of sources and file characteristics
goes on the file level.
Documentation in the next millennium
Presently a lot of SSDs exist. In the DDA alone we have a
couple of thousands. Having a lot of old stuff is a reason
for resistance to change. Furthermore the SSD is still a recognised
standard for description in the data archive community. This
makes exchange of descriptions feasible. Work on a new data
description standard is ongoing. This work is intended to
take into account the advances in technology over the last
quarter century. The results of that work are awaited in breathless
suspense.
Items in the SSD
General information
001 Status of the study in the data archive
002 Classification of the study in cluster(s)
003 Relevant keywords for the study
004 Language employed in the present study description
005 Abstract of the study description
Identifications and acknowledgements
101 Bibliographical reference
111 Local data archive where the study is stored
112 Data archive where the study was originally stored
121 Depositor (donor)
122 Data of deposit
131 Principal investigator (Research organisation)
132 Data collector
141 Research initiator
142 Funding agency
199 Other identifications/Acknowledgements (Specify):
201 Research Topic (Abstract)
202 Kind of data
211 Units of observation
212 Number of units (Cases)
213 Dimensions of the dates
214 Completeness of the study stored
220 Time period covered
221 Time dimensions
222 Definition of total universe (Universe sampled)
223 Sampling procedures
225 Geographical area covered
231 Dates of data collection
232 Method of data collection
233 Type of research instrument
234 Actions to minimise losses (Specify)
235 Data gathering staff
236 Characteristics of data collection situation noted
241 Weighting
299 Other analysis conditions
Re-analysis conditions
301 Present data representation
302 Applicable analysis packages
303 Applicable retrieval systems
304 Information stored in retrieval system
305 Classification of scheme applied
311 Language(s) of written material
321 Control operations performed by original investigator
322 Control operations performed by data archive
331 Accessibility
332 Access directing authority
399 Other re-analysis conditions
References to relevant publications/results/studies
401 to 409 Publications/reports by the primary investigator
411 to 419 Other publications (Secondary analysis)
421 to 429 Unpublished papers/reports of interest
431 Results of analysis (Scales, indices etc.)
441 References to related studies
499 Other references (Specify)
Background variables included
501 Basic characteristics
502 Place of birth
503 Residence
504 Housing situation
511 Household characteristics
512 Characteristics of parental family/household
521 Place of work
522 Occupation
531 Income
541 Education
546 Social class
551 Politics
556 Religion
561 Capital assets
562 Consumption of durables
571 Readership, mass media and 'cultural' exposure
576 Organisational membership
599 Other background variables included (specify)
General International Standard Archival Description - ISAD(G)
Rules
2.1. Description from the general to the specific
2.2. Information relevant to the level of description
2.3. Linking of descriptions
2.4. Non-repetition of information
Elements of description
3.1 Identity Statement area
3.1.1 Reference codes
3.1.2 Title
3.1.3 Dates of creation of the material in the unit of description
3.1.4 Level of description
3.1.5 Extent of the unit of description
3.2 Context area
3.2.1 Name of creator
3.2.2 Administrative/Biographical history
3.2.3 Dates of accumulation of the unit of description
3.2.4 Custodial history
3.2.5 Immediate source of acquisition
3.3 Content and structure Area
3.3.1 Scope and content / abstract
3.3.2 Appraisal, destruction and scheduling information
3.3.3 Accruals
3.3.4 System of arrangement
3.4 Conditions of access and use Area
3.4.1 Legal Status
3.4.2 Access conditions
3.4.3 Copyright / Conditions governing reproduction
3.4.4 Language of material
3.4.5 Physical characteristics
3.4.6 Finding aids
3.5 Allied material area
3.5.1 Location of originals
3.5.2 Existence of copies
3.5.3 Related units of description
3.5.4 Associated material
3.5.5 Publication note
3.6 Note area
3.6.1 Note
Footnotes
1. Johansen, Hans Chr. Befolkningsudvikling og
familiestrucker i det 18. Erhundrede. Odense University Press;
211 pp. 1975
2. General International Standard Archival Description
3. Here and in the following the description
of the ISAD(G) is based on: "ISAD(G): General International
Standard Archival Description", Ottawa 1994. Unfortunately,
the ISAD(G) is not stable yet, and if you obtain a later version
than the one quoted here, some items may have been added and
others altered.
|