Discovering Digital Resources : a Workshop for Historians Draft Workshop Report
A report from the resource discovery workshop organised
by the History Data Service, and held at the University of Essex, 18th and
19th April 1997
1. Introduction and Overview
This report summarises the findings of a workshop organised by The History
Data Service (HDS), and held at the University
of Essex in April 1997. The workshop was one of a series organised under
the auspices of the Arts and Humanities Data
Service (AHDS) and the United Kingdom
Office for Library & Information Networking (UKOLN), which will feed
into the development of the AHDS Integrated Catalogue.
The aim of the workshop was to explore and assess the information requirements
of the historical community for discovering digital resources. The participants
were asked to define a set of information criterion for describing and searching
for historical electronic materials, and assess the suitability of the existing
HDS catalogue records and search facilities against the agreed criterion.
The workshop participants came
from a wide range of backgrounds, representing three primary groups of stakeholders:
data creators; actual or potential secondary users; archivists and others working
with historical data materials in related fields. Prior to the workshop, participants
were supplied with a number of introductory papers, which were intended to place
the workshop in its wider context.
A provisional timetable was drafted
for the meeting with the proviso that it would be subject to alteration as the
issues arose and the discussion developed. The workshop commenced with a series
of introductory talks designed to set the scene for the workshops; Sheila Anderson
presented an overview of the aims and objectives for the workshops; Pam Miller
described the existing HDS catalogue system and how this adhered to the Standard
Study Description standard; Cressida Chappell introduced the Dublin Core; Hans-Joergen
Marker summarised his paper describing
the General International Standard for Archival Description, ISAD(G). Participants
were asked to consider the type of searches they were likely to undertake and
the type of search strategies they might use to identify electronic material.
This session set the scene for the development of the range and type of information
the participants might wish to search for.
Participants moved on to identify the key elements which they wished to search
on and the elements which they would wish to retrieve information on in the
full catalogue record. Identifying the key elements was a time-consuming and
difficult process. Participants experienced some difficulty in separating out
the elements to be used for searching as opposed to those suitable for retrieval.
They also regarded design issues as key
issue in the development of a useable and successful catalogue. They recommended
that the interface to the catalogue should be simple and effective and offer
a multi-level approach to searching. They recommended that the first step should
be limited to a small number of elements and then offer the ability for further,
more sophisticated searches on an additional range of material. The workshop
recommended that these issues are taken into consideration
in the development of the catalogue.
After a limited discussion, participants indicated that they were happy to
accept the existing standard used by the HDS (the SSD) and did not recommend
changing to an alternative standard. However, they did highlight some of the
weaknesses in the SSD and recommended a number of extensions which the HDS will
take forward in its own development work. Using the elements identified and
the elements included in the SSD, the participants proceed to map these against
the Dublin Core in order to asses its viability for the basis of the AHDS catalogue.
The conclusion reached by the workshop is that the Dublin Core is a workable
base for the AHDS catalogue with some alterations and extensions. We are confident
that the results of the workshop can feed into the development of the AHDS catalogue
and ensure that the needs and requirements of the historical community will
be catered for.
2. Resource Discovery Requirements
2.1 Relevant Standards
The workshop assessed the two main standards relevant to the work of the HDS
- the Standard Study Description (SSD) and the General International Standard
Archival Description (ISAD(G)). The SSD was developed by the Social Science
Data Archives, in the 1970s specifically for machine-readable files. ISAD(G),
which was developed during the early 1990s by an ad hoc commission sponsored
by UNESCO and various national archives, is intended to describe archival material
and is not specifically intended for machine-readable files. The existing HDS
catalogue is based upon the Standard Study Description Scheme, modified for
use with historical data files. Following the introductory sessions participants
were asked to assess the two standards for describing historical data files.
Participants concluded that the existing standard used by the HDS - the SSD
- met their needs, although they did recommend some minor changes and additions.
2.2 Search Elements
After an initial discussion, the participants identified four elements as essential
to the discovery of historical electronic materials - source, geography, time
period and topic. Following further discussion, title and person/organisation
were added to the list of essential elements. Participants had been asked to
distinguish between elements for searching and elements of information that
they would wish to see when the catalogue record was retrieved. However, it
became clear that this would be very difficult. Therefore, what follows - although
primarily listing the elements upon which the participants wish to be able to
search on in the AHDS catalogue - also includes a range of information that
may be included for retrieval purposes.
2.2.1. Title
The title function would simply involve a search of the title field.
This was agreed to be useful in instances where a searcher knew
the title of the dataset or may wish to search for datasets with particular
words in the title. This element may be of particular use for librarians and
others working in the field of information searching and provision.
2.2.2. Person/organisation
The person/organisation function would entail a search of all the fields
in catalogue record, which contain the names of person(s) or organisation(s)
associated with a dataset. This field
would be of relevance when the searching for datasets created by a particular
person or organisation.
2.2.3. Time period
This function would allow for the recovery of all datasets which cover a
given year, or period of years. There was a recommendation
that the time period should, where possible, be linked to the source. For
example, where a dataset has been created from multiple sources, the time
span covered may be a significant period of time but the time period covered
by each individual source may not cover the entire period; the participants
were keen that this should be obvious when looking at the catalogue record.
In addition, participants expressed a desire to include information on periodicity
where this is appropriate to the sources used.
2.2.4. Geographical location
This function would allow for the recovery of all datasets which cover a
given place at a sufficient level of detail. Two functions were highlighted
- the requirement to search on a place name, preferably within a hierarchical
thesaurus and second, the ability to search and retrieve information on the
lowest spatial unit to which the data may be dis-aggregated. Thus a search
for Essex should recover all datasets which are indexed by the term Essex,
plus all datasets which are indexed by places within Essex, and all datasets
which include county level data and are indexed by a higher level index term.
2.2.5. Source
This function would record details of the source or sources from which the
dataset was created. Participants wanted to be able to search for source with
a multi-level approach, form a generic level i.e. census records down to the
specific reference number relevant to the particular source. Thus searches
would range from the very general, for example, a search for taxation records,
to the very specific, for example, a search for a PRO or other archival reference
number.
2.2.6. Topic
This function would highlight the central topics relating the dataset. The
participants recommended that it should incorporate a freetext search of the
entire catalogue record.
The workshop participants expressed great interest in the design of the information
retrieval system. They favoured a system which would combine both simple and
complex interfaces. The simple interface would offer a limited number of search
options, and the more complex interfaces would offer a wide range of search
options which would encompass the full range of fields in the SSD.
2.3 Retrievable Information
The participants recommended that the Standard Study Description scheme as
used by the HDS is sufficient to enable searchers to identify historical datasets
of interest. However, the participants suggested that it might be further improved
by including the following extensions:
1. More information about the relationship between sources and datasets, in particular
the level of transcription and the amount of coding
The workshop participants felt that, since sources are crucial in the context
of the historical disciplines, there should be more information about the
sources and the relationship between sources and datasets. The following types
of information were identified as being particularly useful: information about
archival reference numbers; information about the level of transcription and
compilation; and information about the amount of coding and the process and
method of coding, for example whether a dataset has been pre or post-coded.
2. More information about boundary geographies, spatial units and the granularity
of the data
The workshop participants felt that this information would be crucial to
many users. The issue was, however, deemed to be too large and complex to
discuss in full during the workshop. It was recommended that
a working group should be established to consider this issue.
3. More information about the structure of datasets
The workshop participants felt that information about whether a dataset is,
for example, a relational database would help potential users assess the utility
of a dataset. It was felt that this information might be particularly useful
for users who were interested in using datasets for teaching.
4. More information about the format of datasets, in particular the size of datasets
and the software they are held in
The workshop participants felt that this information would also help potential
users assess the utility of a dataset.
5. More information about the software and versions used to create datasets
The workshop participants felt that this was important information because
it would provide clues about both the nature of a dataset and the ways in
which it might be used
6. On-line documentation
The workshop participants felt that access to on-line documentation is development
which would be particularly helpful in allowing potential users assess the
utility of a dataset.
7. A sample of the data
The workshop participants also felt that a relevant sample of the data would
be particularly helpful in allowing potential users assess the utility of
a dataset.
2.4 Reactions to the Dublin Core
Once participants had identified the core search and retrieve elements, some
time was spent mapping these against the Dublin Core elements in order to highlight
any difficulties and to recommend any changes and extensions. In general, the
workshop participants were happy to accept the Dublin Core as the basis for
the AHDS catalogue. A number of recommendations were made and these are outlined
below :
2.4.1. Title
Label: TITLE
The name given to the resource by the CREATOR or PUBLISHER.
The TITLE element is unproblematic, and users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
2.4.2. Author or Creator
Label: CREATOR
The person(s) or organisation(s) primarily responsible for the intellectual
content of the resource. For example, authors in the case of written documents,
artists, photographers, or illustrators in the case of visual resources.
The CREATOR element is problematic because the creator is defined as the
person(s) or organisation(s) primarily responsible for the intellectual content
of the resource. This element would not be problematic if the creator was
defined as the person(s) or organisation(s) primarily intellectual responsible
for the resource. This subtle difference is important because historical datasets
are mainly transcriptions of original sources. In the case of transcriptions
the person(s) or organisation(s) who are responsible for 'creating' the dataset
can be held to be intellectual responsible for the dataset, but can not be
held to be responsible for the intellectual content of the dataset. The person(s)
or organisation(s) who might best be held to be responsible for the intellectual
content of the dataset, are instead the person(s) or organisation(s) who created
the original source(s).
The concept of primary intellectual responsibility might also be
problematic in some contexts. Thus it was felt that the CREATOR and CONTRIBUTORS
elements might best be combined into a single element which could encompass
all the person(s) and organisation(s) connected with a resource.
2.4.3. Subject and Keywords
Label: SUBJECT
The topic of the resource, or keywords or phrases that describe the subject
or content of the resource. The intent of the specification of this element
is to promote the use of controlled vocabularies and keywords. This element
might well include scheme-qualified classification data (for example, Library
of Congress Classification Numbers or Dewey Decimal numbers) or scheme-qualified
controlled vocabularies (such as Medical Subject Headings or Art and Architecture
Thesaurus descriptors) as well.
The SUBJECT element is unproblematic, and users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
2.4.4. Description
Label: DESCRIPTION
A textual description of the content of the resource, including abstracts
in the case of document-like objects or content descriptions in the case of
visual resources. Future metadata collections might well include computational
content description (spectral analysis of a visual resource, for example)
that may not be embeddable in current network systems. In such a case this
field might contain a link to such a description rather than the description
itself.
The DESCRIPTION element is unproblematic, and users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
2.4.5. Publisher
Label: PUBLISHER
The entity responsible for making the resource available in its present form,
such as a publisher, a university department, or a corporate entity. The intent
of specifying this field is to identify the entity that provides access to
the resource.
The PUBLISHER element is problematic because the label publisher is inappropriate
in some contexts. The HDS is a distributor or disseminator, thus it was felt
that the PUBLISHER element needs a qualifier called distributor or disseminator.
2.4.6. Other Contributors
Label: CONTRIBUTORS
Person(s) or organisation(s) in addition to those specified in the CREATOR
element who have made significant intellectual contributions to the resource
but whose contribution is secondary to the individuals or entities specified
in the CREATOR element (for example, editors, transcribers, illustrators,
and convenors).
The CONTRIBUTORS element is not in itself problematic: however, because of
problems with the CREATOR element, it was felt that the two elements might
best be combined into a single element which could encompass all the person(s)
and organisation(s) connected with a resource.
2.4.7. Date
Label: DATE
The date the resource was made available in its present form. The recommended
best practice is an 8 digit number in the form YYYYMMDD as defined by ANSI
X330-1985. In this scheme, the date element for the day this is written would
be 19961203, or December 3, 1996. Many other schema are possible, but if used,
they should be identified in an unambiguous manner.
The DATE element is not problematic, if it continues to be defined as the
date on which the resource was made available in its present form. It should,
however, be noted that the Dublin Core Qualifiers which have been proposed
by Jon Knight and Martin Hamilton of the ROADS project set the default date
type as the date on which the resource was first created. It was agreed that
this element might need to be more precisely defined, and it was recognised
that the date on which a resource was made available in its present form is
easy to define, whilst the date on which a resource was first created might
be difficult to define.
2.4.8. Resource Type
Label: TYPE
The category of the resource, such as home page, novel, poem, working paper,
pre-print, technical report, essay, dictionary. It is expected that RESOURCE
TYPE will be chosen from an enumerated list of types. A preliminary set of
such types can be found at the following URL:
http://www.roads.lut.ac.uk/Metadata/DC-ObjectTypes.html
The TYPE element is not in itself problematic, however it was felt that the
current list of preliminary object types was not appropriate to the needs
of historians and the HDS. It was the view of the workshop participants that
the categorisation `digital resource' would be more suitable than preliminary
recommended object type `dataset'. It was also the view of the workshop participants
that this element would also need to include more specialised information
about the resource type. The specialised data would include the information
about the structure of datasets which the workshop participants recommended
should be included in the SSD, and the types of information which are already
recorded in section 202 of the SSD (Kind of Data).
2.4.9. Format
Label: FORMAT
The data representation of the resource, such as text/HTML, ASCII, Postscript
file, executable application, or JPEG image. The intent of specifying this
element is to provide information necessary to allow people or machines to
make decisions about the usability of the encoded data (what hardware and
software might be required to display or execute it, for example). As with
RESOURCE TYPE, FORMAT will be assigned from enumerated lists such as registered
Internet Media Types (MIME types). In principal, formats can include physical
media such as books, serials, or other non-electronic media.
The FORMAT element is not particularly problematic, and users from the historical
community are likely to interpret this element in exactly the same way as
users from the rest of the arts and humanities community. It was, however,
also the view of the workshop participants that this element might also include
information about the size of resource, and information about the software
it is held in, if the data representation is not ASCII.
2.4.10. Resource Identifier
Label: IDENTIFIER
String or number used to uniquely identify the resource Examples for networked
resources include URLs and URNs (when implemented). Other globally-unique
identifiers, such as International Standard Book Numbers (ISBN) or other formal
names would also be candidates for this element.
The IDENTIFIER element is unproblematic, and users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
2.4.11. Source
Label: SOURCE
The work, either print or electronic, from which this resource is derived,
if applicable. For example, an HTML encoding of a Shakespearean sonnet might
identify the paper version of the sonnet from which the electronic version
was transcribed.
The SOURCE element is crucial to the historical discipline, because virtually
all historical datasets are derived from one or more original sources. It
was the view of the workshop participants that this element should include
a wide range of information about the original sources which would range from
the generic to the specific. The generic data would include the type of information
which is already recorded by the data source keywords, thus for example it
might specify that the source is taxation records. The specific information
would include the types of information which are already recorded in section
203 of the SSD (Data Sources). For example it might specify that the source
is the Hearth Tax Returns which have the archival reference number XXX.
2.4.12. Language
Label: LANGUAGE
Language(s) of the intellectual content of the resource. Where practical,
the content of this field should coincide with the Z3953 three character codes
for written languages. See: http://www.sil.org/sgml/nisoLang3-1994.html
The LANGUAGE element is unproblematic, and users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
2.4.13. Relation
Label: RELATION
Relationship to other resources. The intent of specifying this element is
to provide a means to express relationships among resources that have formal
relationships to others, but exist as discrete resources themselves. For example,
images in a document, chapters in a book, or items in a collection. A formal
specification of RELATION is currently under development. Users and developers
should understand that use of this element should be currently considered
experimental.
The RELATION element is unproblematic, and users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
2.4.14. Coverage
Label: COVERAGE
The spatial locations and temporal duration characteristic of the resource.
Formal specification of COVERAGE is currently under development. Users and
developers should understand that use of this element should be currently
considered experimental.
The COVERAGE element is problematic because it combines both spatial locations
and temporal duration. It was agreed that these are two very important elements
in their own right which need to be separated out. Furthermore, it was the
view of the workshop participants that the spatial location component and
the temporal duration component can each be further sub-divided. The spatial
location component would include both information about the actual places
covered by the data, and information about boundary geographies, spatial units
and the granularity of the data. The temporal duration component would include
information about the time span covered and the periodicity of the data.
2.4.15. Rights Management
Label: RIGHTS
The content of this element is intended to be a link (a URL or other suitable
URI as appropriate) to a copyright notice, a rights-management statement,
or perhaps a server that would provide such information in a dynamic way.
The intent of specifying this field is to allow providers a means to associate
terms and conditions or copyright statements with a resource or collection
of resources. No assumptions should be made by users if such a field is empty
or not present.
The RIGHTS element is unproblematic and, users from the historical community
are likely to interpret this element in exactly the same way as users from
the rest of the arts and humanities community.
3. Workshop Recommendations
This section summarises the recommendations which were made at the HDS resource
discovery workshop concerning the SSD, the Dublin Core and the AHDS Integrated
Catalogue.
3.1 Extending the SSD
Although it was recognised that, in the main, the SSD is sufficient to enable
searchers to identify historical datasets of interest, it was also recognised
that the SSD has some weaknesses, and the workshop participants recommended
the following seven extensions to SSD:
3.2 Refining the Dublin Core
The workshop participants were in general happy to accept the Dublin Core as
a key to the unlocking of more detailed discipline specific resources, and they
were in general happy to accept that the Dublin Core could be used in conjunction
with the SSD. However, it was recommended that the following five refinements
would need to be made:
3.3 Modelling the AHDS Integrated Catalogue
It was recommended that the following six search options, which were identified
as being essential to the discovery of historical electronic materials, should
be included in the AHDS Integrated Catalogue as searchable elements:
It should be noted that although the source search option was identified as
being particularly important, it would be acceptable if this search option was
included in the topic search option. It was also recommended that it should
be possible to combine multiple entries within the search form either with a
Boolean AND or Boolean OR.
The HDS would recommend that the AHDS Integrated Catalogue should be developed
along the lines of the Council of European Social Science Data Archives Integrated
Data Catalogue (Cessda IDC), which is a unified collection of mainly European
social science data archive catalogues, which can be searched through one common
interface. The Cessda IDC can be accessed at: http://dastar.essex.ac.uk/Cessda/IDC
The participants expressed a wish to be able to retrieve the information currently
made available through the HDS information retrieval system BIRON
in the AHDS catalogue. They would not wish to retrieve less information than
is currently available to them.
4. Consultation Process
This document is a draft report of the findings of the HDS resource discovery
workshop, and it will be circulated widely for consultation and comment during
June and July 1997. All comments received by 23 July 1997 will
be taken into consideration and incorporated into the
final version of the report, which will be made available by 30 July 1997.
Comments should be submitted to the authors of the report, either by email to
hds@essex.ac.uk, or in writing to:
History Data Service
The Data Archive
University of Essex
Wivenhoe Park
Colchester
CO4 3SQ
|