History Data Service
 

Digitising History


CHAPTER 3 : FROM SOURCE TO DATABASE

 

Guide to Good Practice Navigation Bar






























































































































































Guide to Good Practice Navigation Bar


3.4 Sketching the database structure

Designing and mapping the database structure is one of the most crucial stages of any database creation exercise. This can, and often is, a lengthy process that combines some degree of technical know-how with intellectual decision-making. This section will not go into the specifics of database design - there are many references that project personnel are strongly advised to consult - instead it is intended to provide some broad suggestions for best practice and some advice on the major issues involved.

Perhaps the best starting point for database design, specific to historians, are chapters 1 to 5 in Harvey and Press (1996). Chapter 5 in particular is the best introduction to historical database design published thus far. For more specialist publications on relational databases, database systems and structured query language (SQL), perhaps the most popular is Date (1994) which has gone through a number of editions over the years. In addition, there is an excellent set of web guides about creating and publishing web databases (Scammell 1999).

These rather technical references are useful reading, since any historian who intends to design a database must combine a thorough knowledge of the source material with a fundamental understanding of the potential and limitations of databases and database systems. Database design is ultimately an exercise in methodology.

3.4.1 Sources and tables

For standard relational databases, the design stage concerns translating an historical source, or collection of sources, into an efficient conceptual data model. To simplify the meaning of this process, it could be adequately described as turning sources into tables. It's worth noting that during this process, subjects from one source (e.g. house, person, ship, goods etc.) may form many tables, and it is perfectly possible that subjects across various sources may be collected into single tables. Even if the source material is not inherently tabular, a table (or more likely a collection of inter-linked tables) is the format in which the desktop database stores the information. What is under the power of the database designer is not so much the generic structure of the information (which always rests on the premise of rows as records and columns as fields), but rather how the information itself is structured within this matrix strait-jacket. Additionally, whether the source is dismantled into a multitude of related tables and decisions taken about data-types, linking fields and standardisation of values are also primary concerns of the designer.

Unlike some social science datasets, historical tables make sense only when they are precisely defined representations of particular subjects of enquiry, whether taken from single or multiple sources. Databases that contain hundreds of fields in order to fit the data all under one roof are often unwieldy to use and are essentially the product of questionable design. In the same way, databases that contain virtually empty tables are equally as frustrating and inefficient. There is a subtle balance to be achieved between too many or too few tables and too many or too few fields. Database designers have to employ their own sense of what is usable and what is not. Remember that the design of the tables ultimately influences whether the database can be used effectively by both the project itself and further researchers (Harvey and Press 1996, 105-18).

As far as sketching an abstract design is concerned, establishing tables can be done via the 'entity modelling' approach, more formally known as entity relationship modelling (ERM). This is an excellent method for sorting out exactly what sources are going to form what tables, or indeed what parts of the source(s) are going to form tables. This method has the advantage of being a 'top-down' approach and therefore allows the historian to establish an abstract overview of their material and then begin establishing the data entities. Once the entities have been decided upon, the historian can begin a closer examination of these 'meta-tables', which may result in new entities being created and data being restructured in order to create the most efficient and usable database schema. For some examples of this procedure in practice, refer to Acun et al. 1994; Harvey and Press 1991; Champion 1993; Pöttler 1994 and Scammell 1997.

It is important to keep as detailed a record as possible of the relationship between the database structure and the documentary sources. A comprehensive description of the design model is an essential aid both to the creator(s) and, of course, secondary users. This is an issue of documentation (Chapter 5), but it is important that it is flagged and recognised at an early stage.

3.4.2 Fields and data types

Within database systems tables are made up from fields (sometimes referred to as variables). Highly structured historical documents can also be said to be formed from discrete continuous elements of information - yet reconciling the database field with the element in a source is not necessarily an easy task. This is generally because database systems do have de facto limitations on the information that any individual field can hold and these are sometimes not obvious. Historians involved in database creation need to be aware of the potential problems that attempting to squeeze inappropriate data into a field can create. Again, this is as much an issue of common sense as it is of adherence to technical limitations.

Probably the most pressing difficulty with historical sources is that some elements of the data may have rather long textual descriptions or some fields may occasionally have more than a single value associated with them which do not warrant the creation of a separate table. Common examples are multiple occupations, marginal or interpolated notes, and maybe even value changes over time (although this is in a sense a separate issue). Standard database systems do not like these characteristics. Although most database systems will allow textual fields up to a limit of 250 characters, filling up fields in this way is bad practice. Some desktop database packages do allow the incorporation of large textual data into essentially tabular structures, although this is by no means a universal feature. Adding extra fields to compensate for the odd extra value or lengthy textual entry results in an inefficient database. For some examples of real world solutions to these problems, see Schürer and Diederiks (1993). There are some useful strategies for dealing with data that fall outside of standard field specifications, particularly with reference to full-text which can be dealt with using memo fields or even storing the text in separate text files. Using codes and standardisation practices can also overcome some of the limitations imposed by using relational databases (see (Section 3.5.2), although potentially they can constitute a significant body of work, the resource implications of which must be considered by any project.

Database fields have another important restriction with regard to data type. Almost all analysis software requires each field in a database to meet a data type specification. Table 1 lists common database data types which may differ slightly from package to package.

Table 1 Common database data types

Type Description
Text Alphanumeric characters, basically anything printable from the keyboard.
Integer Numbers only (unit separators can often be user defined).
Float Numbers with decimal points, often referred to as floating point integers.
Memo Some databases allow full-text memo fields (some cannot be queried however).
BLOB or OLE Some databases allow using Binary Large Objects for inserting multimedia formats such as images or full documents.
Date Date format entries. Usually take the form of dd/mm/yyyy. (21/01/1998).
Currency Essentially as Integer, sometimes with the addition of a currency symbol and decimal places.

The point about data types in relation to historical sources is that they are based on the assumption that the field values will consistently fall within this type. Many database systems will simply fail to work if there is any deviation from this standard. Of course, most source material will not be consistent in this way and database designers must examine their data closely in order to establish where potential pit-falls may occur. By far the most common difficulty occurs with numeric fields. Variables that we may designate as a numeric data type such as 'Age', 'Page Number', 'Population', 'Tonnage', often in real sources contain textual characters. It is usual to find entries such as '5 and a half', '30A', '30569 appx.', or '124tns'. Designers who foresee a need for calculations to be performed on such fields must render them as numeric only.

Good practice with such limitations is to apply consistency that meets with the project objectives and research compass, whilst making documentary notes for future researchers concerning the decisions and judgements made.

 

© Sean Townsend, Cressida Chappell, Oscar Struijvé 1999

The right of Sean Townsend, Cressida Chappell and Oscar Struijvé to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.

Next Bibliography Back Glossary Contents