History Data Service
 

Digitising History


CHAPTER 3 : FROM SOURCE TO DATABASE

 

Guide to Good Practice Navigation Bar








































































































































































































Guide to Good Practice Navigation Bar


3.5 Transcription and data entry

One of the rather irritating aspects of data creation projects is the fact that at some point someone has to 'create' the data. The term 'irritating' is used because the data input stage can be both immensely time consuming and requires attention to detail hitherto unnecessary if the database is to be (largely) error free. The best database design in the world can be utterly destroyed by poor quality data. Although no data collection is perfect, there is a certain quality threshold which, if fallen below, can make a database impossible to use. Most researchers can cope with 'dirty data' but it is up to the creators to minimise the errors. This will make their own research more productive and avoid the need to engage in a massive retrospective correction phase. For a good summary of data entry and validation approaches that supplement this guide, see Harvey and Press 1996, 85-94.

For almost all character-based historical documents (especially hand-written) the only way to transfer information from paper to the computer is through typing. Typing itself can leave an exceptionally large window for human error, and even the most fastidious typists will make mistakes. The transcription process is nothing new to historians, yet it is reasonable to argue that transcribing into a database is a subtly different task which in general requires more accuracy and the use of validation methods. One method for overcoming many of the vexed burdens of historical sources is to apply wide-ranging data standardisation and coding schemes. The application of standards and codes can introduce a high level of accuracy in data, and is more amenable to database system requirements. But such a process also has the potential to be a significant resource consumer which small projects may not be able to accommodate.

3.5.1 Transcription methods

Possibly the best approach to transcription into relational databases is to establish some initial hard rules. In other words, a transcription methodology is constructed in order to tackle some of the ambiguities of historical material and render the database as a consistent and logical whole. The exact nature of the rules and approach will very much depend on the assessment of the original document. There are a number of methods that previous historians using computers have employed, and those creating a resource may want to consider some of these options.

For database creation projects where the intended volume of data is particularly large, transcribing direct from the source into the database can present some difficulties. Also where it is envisaged that some restructuring of the original source will be required and where direct literal transcription is not intended, copying direct from the source document can often be a frustrating experience. One method that may be of particular use is the application of a transcription template. A transcription template is a highly structured form into which the data collector transcribes data from the original source. Templates generally mirror the structure of the end-database and contain easily legible transcriptions that are most useful for cross-referencing and error checking. The large-scale 1881 Census project carried out by the Mormons adopted this method (GSU 1988), other smaller projects have also used transcription templates in order to organise their work (Wakelin and Hussey 1996 and Schürer 1996). The advantages of using a transcription template (for larger projects) could be suggested as follows:

  • Allows the data creators to match the template exactly with the database structure.
  • Creators can more easily compare the original source with the transcribed copy. Often computer databases can be cumbersome in this respect.
  • Allows a 'thinking and decision making' period before the data are entered directly into the computer. Historians can use this time to consider and reflect upon their judgements and can use the template as a filter process.
  • When it comes to real data entry, working from a clear template is significantly easier than working from the original.
  • Very useful where there are multiple data collectors or where there are two or three levels of transcription checking.

Another method which resource-pressured projects might favour is to use some of the features of the database software itself to assist in data entry and transcription. For example, most database packages will allow the entering of data items to be done via a form interface. A project could develop data entry forms reminiscent of the transcription template described above. Forms should be designed in such a way as to facilitate data entry, allow easy cross-checking, and if there are many tables, allow seamless and intuitive navigation through these.

Quite apart from being easier on the eye than direct data entry into tables, forms can also allow some degree of automated validation. In this way, one can use the restrictiveness of data fields to assist in maintaining accuracy at the data entry stage (Harvey and Press 1996, 130-39). This method can be particularly useful when standardising textual or numerical entries and can be positively invaluable when creating a heavily coded database (see Section 3.5.2). Many popular database systems will allow certain fields only to accept values of a particular data type, data format, or range, sometimes through pull-down menus and look-up tables. For example one could restrict age fields to accept only numeric values, sex fields to accept male, female, or unknown, and date fields to accept only the format dd/mm/yyyy. Also it is possible to apply a more sophisticated level of validation by using look-up tables, hot-key entry and more intelligent input programs. For some instances of such use, see Welling (1993). Although such a method may seem an unnecessary allocation of time and resources, projects should consider very carefully their overall strategy for the data input phase of creation. This is often the forgotten element in resource creation endeavours (probably because of its unglamorous nature). As with most things in database design and creation, effort and thought allocated in the early stages will almost always save time and pay dividends in the long run.

3.5.2 Codes and standardisation

The literature of historians using computers is, unsurprisingly, well represented by publications that tackle the issue of codes and standards. However, what is meant by codes and standards very much depends on the context. For many, what they mean by coding is actually the mark-up coding scheme employed as part of digitising largely textual resources. Also the issue of data standardisation is often referred to as the strategy for data formats in order to facilitate data interchange (see Section 4.3.2). This section, however, will discuss the issue of actually standardising data values and applying coding schemes to data values as a means of overcoming some of the deficiencies of database packages for historical use.

Some historians have always had a rather difficult time dealing with the concept of information standardisation. It is true that to a large extent this has its roots in the belief of the sanctity of the original source and the over-riding necessity never to alter or change it for the purposes of a particular research agenda - essentially the battle between qualitative and quantitative history. This might explain the popularity of the source-oriented methods pioneered by Manfred Thaller and the KLEIO system. The entire raison d'être of such an approach is to keep the source as intact as possible and not hack it to pieces, as is often necessary with standard industry software. For those using standard database applications, however some alteration and editing of the source will almost certainly be necessary in order to prevent the resource being practically unusable. This is an inescapable fact with the vast majority of historical sources (some exceptions being high quality editions or reports).

Standardisation has many benefits if it is carried out with strong principles and sound guidelines. Many historical projects have had to rationalise their data in order to allow record-linkage (Wrigley and Schofield 1973; Bouchard and Pouyez 1980; Davies 1992; Harvey and Press 1996; King 1992), or to map spatial data into GIS software (Silveira et al. 1995, Piotukh 1996 and Southall and Gregory 2000).

No project should feel that this is destroying the provenance of their material; this is a misguided concern, provided that clear and detailed documentation is kept to help alleviate secondary analysts' concerns about methodology and how and why decisions were taken. The nucleus of the problem is that relational database focused research is best carried out on consistent information. Historical documents are usually rife with inconsistencies and errors and it is natural that in database form these anomalies should be removed. However, the extent of standardisation is an important decision. The rationalisation of obvious spelling variations, date renderings and units of measurement are all justifiable in most cases. When to standardise and when not to is an intellectual decision to be taken either by the single resource creator or by the collective project team (see Wakelin and Hussey 1996, 19). For further references on this issue, including advocacy for post-coding, consult Schürer 1990; Gervers and McCulloch 1990.

Coding data values is probably the ultimate form of data normalisation. It has been common practice in the broader social sciences for many years, and almost all current UK Government surveys are essentially large collections of coded data. In the case of quantitative analysis, coding is often an essential element of data creation. For numerous large-scale historical projects, the use of coded variables has been seen as crucial (Ruggles and Menard 1990). Yet data coding is a resource-intensive process in itself, and projects that deem the application of a coding scheme to be beneficial should be alive to the possible implications. In fact the coding of fields such as occupations is really nothing new and dates back to the later part of the nineteenth century (Booth 1892-1897). Historical computing literature also has a number of articles on this complex problem (Green 1990; Greenstein 1991a; Morris 1990). In a number of projects, coded variables are 'tacked' onto the end of generally textual data tables (Anderson et al. 1979). The advantage of this approach is that it maintains an original textual entry with the cross-reference for analytical purposes of its coded counterpart. Coded data collections are heavily reliant on the quality of accompanying documentation and most particularly on codebooks. Codebooks are essentially guides to the codes which map out the classification scheme used, linking an often numerical code with a textual descriptor. This descriptor is usually either the original entry from the source, or a group classification identifier. Some analysis software (most commonly used statistical packages such as SPSS or Stata) will also allow the designer to define code labels within the application, allowing easy look-up and more meaningful analysis to be performed.

Projects that use codes, particularly those of a highly interpretative nature, are encouraged to document their decisions carefully. In addition to analytical benefits there can also be data entry benefits. The use of abbreviated codes can make the data input stage a faster, more efficient process, especially if an 'intelligent' input form is used. Yet the use of codes can move the database further away from the original source, which has a knock-on effect for the quality level required in documentation. If the project envisages returning to the database after a number of years, or if the data are archived for secondary users, making clear the connections between a heavily normalised data source and the original material it has been extracted from is all the more vital.

 

© Sean Townsend, Cressida Chappell, Oscar Struijvé 1999

The right of Sean Townsend, Cressida Chappell and Oscar Struijvé to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.


Next Bibliography Back Glossary Contents