History Data Service
 

Digitising History


CHAPTER 3 : FROM SOURCE TO DATABASE

 

Guide to Good Practice Navigation Bar





























































































































































Guide to Good Practice Navigation Bar


3.3 Links between source and database

It is probably worth stating right at the outset that a database is not a historical source. A computer cannot represent all of the characteristics of an original document and even high-resolution digital images (currently the best copy of an original source in digital form) (Jaritz 1991) do not convey every nuance of the original (paper type, condition, size, smell, texture etc.). As this guide is primarily concerned with character-based database systems, the shortcomings of the database version relative to the original can immediately be recognised.

In reality, the shortcomings of a properly designed database to a historian are not profound. Historians have always filtered the crucial elements of information and it is relatively irrelevant in what format these data are made available. Although the Post-modernist critics of history would probably shriek at the very notion of a further subjective representation of an original highly subjective document (Evans 1997), it is important to recognise that a well-designed and implemented database can be one of the most powerful tools for historical enquiry. The key is establishing exactly how the database represents the original, maintaining clarity and avoiding ambiguity in both explanation and implementation of this often-complex relationship. This section of the guide will provide some suggestions on how this can be achieved, and hence immeasurably increase the value of the digital resource.

3.3.1 Source assessment

Knowing as much about the source as possible is a prime requisite for a successful resource creation project. Apart from anything else, it is an exercise in problem anticipation. For some interesting real world examples of this, see Harvey and Press 1991; Green 1989 and Millet 1987. It is something of a truism to say that some source types transfer to tabular form rather better than others do. One problem that historians creating tabular databases have recounted on numerous occasions is the difficulty in making a real world source document 'fit' into the rigid artificial model of the desktop data analysis package. Other analysis systems that can deal with textual non-structured data are available, and many historians have used such tools in the face of the restrictions of purely tabular databases. For examples where researchers have abandoned the formal relational or tabular database approach, see Woollard and Denley 1996.

Many projects deal with multiple rather than single sources. Resource creators are advised to consider each particular source on its own merits and evaluate the material's viability or otherwise for digital transference. This could be considered a selection process by which certain materials can be deemed suitable for transcription into database form, whereas creators may deem that other materials present too many difficulties and would not form part of a database. One of the key dangers to avoid is the situation where the project wastes resources on a difficult source because it was not identified as a potential problem at the outset.

Historical sources in the real world unfortunately do not always fit neatly into the 'good for tables' or 'bad for tables' categories. Very often a source will in general be suitable for the creation of a highly structured database but will include elements, possibly significant elements, which are not so suitable. If material is deemed to be suitable for a structured database in general but with a small proportion of problematic elements, then the project needs to think carefully about how these difficult aspects can be dealt with effectively. One possible approach is to digitise these elements in full but leave them outside of the core database application. For an example of how a project has tackled this problem effectively, see Hatton et al. (1997). Provided the main body of the source forms the basis of a viable core database, then creators should not become obsessed with the incorporation of every microelement within it. More often than not, this will simply be impossible anyway. Documenting this filtering process is, however, an extremely valuable and worthwhile process.

3.3.2 Will the source 'fit'?

Modern desktop database applications are extremely sophisticated pieces of software. However, despite this sophistication, all relational database systems essentially operate on an identical data model. Some systems have extended or altered the functionality of this model, but the core model has remained unchanged for almost twenty years (Date 1994; Harvey and Press 1996, 102-39). The relational model is based on the premise that the database is formed of related tables whose composition takes the form of rows as individual items and columns as fields of information about those items. This rectangular structure was designed primarily for use in the transaction-based business world, and so its application to real (particularly historical) documents can have a number of significant shortcomings. For the most convincing and consistent attack on the relational model for use with historical sources see the articles of Manfred Thaller (Thaller 1980; Thaller 1988a; Thaller 1988b; Thaller 1989 and Thaller 1991). It is also important to assess the response to Thaller's criticisms and consider those who support the use of relational database systems in historical research (Harvey and Press 1996; Champion 1993; Greenstein 1989).

Despite the limitations of desktop databases, it is eminently possible to create a database that cleanly and effectively represents an original historical source. Before a project embarks on actually designing the database itself, it is good practice to consider what the overall relationship between the original paper source and the database is likely to be. As a quick and dirty method for approaching this issue, a number of questions can be asked:

  • Does the source have a predominant and consistent structure?
  • Are the important items of information relatively well defined, i.e. are they succinct, enclosed elements of information rather than unstructured long textual entries?
  • If there are units of measurement (e.g. currency, time/date, dimensions, weight etc.), can they be easily represented for analysis in a database?
  • Are there any important non-text, or non-printable characters?
  • Is the information itself inconsistent? (Perhaps the most infamous problem is variations in spelling as well as clearly obvious original errors.)

Relational databases are ultimately designed to process structured and consistent data in tabular form. Such software is simply not designed to process large chunks of unstructured text, and in some cases performing transformations or calculations on historical dates or measurements can be overly time consuming. Most modern desktop analysis packages will allow data conversions into additional derived fields without too much pain, and this can be very beneficial to both the initial and potential secondary analysts (Schürer and Oeppen 1990). Creators must make their own judgement on the extent of calculated derived fields.

It should not come as a shock to discover that a significant proportion of historical databases created thus far have been drawn from broadly tabular sources. Poll books, census enumerators' books and reports, registrar general reports, port books and parish registers, are all sources with essentially tabular structures that match with relational databases very well. It is, however, possible to convert other highly structured, but non-tabular, sources into a tabular form whilst retaining the important aspects of the original data. Any project that is serious about creating a digital edition of a source has to make a value judgement about its viability. For a project-focused critique of the relational approach versus other more source-oriented systems see Burt and Beaumont James 1996.

As a broad guide to a source's fitness for tabular database creation, these are some of the qualities in documentary material that may require extra resources or present certain difficulties:

  • Unstructured lengthy full-text. (More suitable for 'qualitative' database systems (see Feldman 1995; Kelle 1995; Miles and Huberman 1994), however, some relational systems, such as Filemaker Pro and Oracle, support full-text.)
  • Complex or ambiguous units of measurement (e.g. dates rendered in medieval nomenclature can be difficult to use/analyse in standard database packages). Using derived fields is the key here, remembering the possible implications for time.
  • Large-scale inconsistencies throughout the source. (Commonly spelling variations, measurement rendering and geographical references.) Section 3.5.2 provides some discussion of these issues.
  • Overly complex structures. Ultimately this manifests itself as databases with large numbers of tables and complex (often many-to-many) relationships. Too many tables can make databases daunting and difficult to use.
 

© Sean Townsend, Cressida Chappell, Oscar Struijvé 1999

The right of Sean Townsend, Cressida Chappell and Oscar Struijvé to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.


Next Bibliography Back Glossary Contents