![]() |
![]() |
||||||||||||||
|
Digitising History CHAPTER 4 : FURTHER DATA AND PRESERVATION ISSUES
|
|||||||||||||||
|
|
4.2 Database integrity and practice Although it is possible to clean and validate a database via data input systems, there is a strong case for retrospective checking so that the database is processed for potential problems, oddities and errors. Errors can slip through even the most stringent of validation procedures. Very often potential problems are not at all obvious. One of the most common difficulties concerns 'special' characters that database systems can often misinterpret as being separate from the data itself. These can create headaches for a project as well as being rather cumbersome for potential secondary analysts. The essential objective is for creators to ensure within resource limits that their data collection is usable and free of extraneous data. Any correction phase should be relatively short, and this guide will highlight some of the most commonplace problems. Indeed, most difficulties usually arise because of the way in which computers and database systems work rather than any underlying problem with the database itself. Creators should regard this section as a guide to the final stage in checking their data at a micro level in order to prepare these for serious research purposes, or for archiving at a data archive. 4.2.1 Error checking, back-ups and validation No database is perfectly clean, but some are genuinely cleaner than others. All creation projects should at least factor in some kind of 'post-input' validation scheme. Errors may have slipped through the initial safety net, the project may decide it would be preferable to change the way an aspect of the data is represented, or there may be unforeseen problems which only show themselves after all the hard work of data input has been completed. In some cases errors can seriously affect whether a data collection is usable, but in most instances the errors are generally small and easily fixed. Databases should be scanned, preferably at regular intervals during the input phase or in a specified time once the input is completed. Any project should make a judgement about the suitability of either approach. This author (ST) has found that interval checking is often the best method simply for the reason that it can draw attention to problems in the embryonic stages, which allow corrective measures to be applied. In one instance, the discovery that the database software had truncated a field from 80 characters to 10 was, thankfully, discovered after the input of only 100 records. The fields had to be re-entered but the cost was minimal because of the policy of regular data checks. Software can sometimes be unpredictable and data can be lost or corrupted for a host of unknown reasons. Because of the somewhat unstable nature of both software and hardware systems, all validation procedures should be accompanied by a comprehensive and sensible back-up procedure (Harvey and Press 1996, 37). There have in the past, and will probably continue to be, horror stories of projects losing either all or large chunks of their data which could not be replaced because no back-up methodology was employed. Some key principles to remember:
Network crashes, software bugs, or virus infections can all destroy what is ultimately a volatile resource. Needless to say, it is in the interests of the project and the wider research community that digital resources are protected from such disasters. Validation should be a systematic process. It is fairly simple, if tedious, to apply some checks to databases using database query functions (either SQL or forms-based queries) (Morrison 1989). Databases can be examined for errors table by table if necessary but not all fields may be appropriate for checking. As a general processing rule, fields for which we expect certain limited value ranges to occur are the most important. These are the ones where anomalies are easy to identify and subsequently correct. Also, data collections that contain coded variables are simple to check (if the input stage is not foolproof). One possible method for relational systems is to construct a query that uses a GROUP and COUNT function (see Harvey and Press 1996, 159). This would assemble a list of the values of a particular field and count the occurrences of each one. Much the same as a frequency count, this would allow for the identification of outliers or erroneous entries. Of course, this will not indicate if a value within the particular range has been typed in error. If the project is keen to iron out all possible mistakes, then some cross-corroboration with the original source will be required. Some problems of data integrity can be tough to spot. It is also important to warn that some characters and combinations of characters can result in database systems not exporting or importing data correctly. It is unfortunate, but hardly surprising, that these special characters often occur in historical sources. In addition, computers also use a number of characters such as spaces, tabs or end-of-lines that are not always immediately visible. When preparing data for deposit with a data archive, creators are advised to ensure that their data will not present difficulties in importing or exporting from one software system to another (see Section 4.3.2 for a more detailed discussion). Table 2 describes the main special characters over which database creators are advised to take care. Many of the potential pit-falls can be avoided if it is recognised where and when special characters are used. In particular, special characters can become a problem when data is exported to an ASCII data format. ASCII data formats are important for preservation purposes, but database systems can assign special, and originally unintended meanings to special characters when importing data from ASCII files. A very common problem is extraneous space characters that may occur at the beginning or end of a field entry or may have been double typed (i.e. two consecutive spaces). These should be cleaned up as they can present problems when querying data. For example an entry in an occupation field such as ' Farmer' (with a leading space) will not be picked up by a query asking for matches of 'Farmer' (no leading space). The origin of such problems is that databases, unless told otherwise, match the input query strings exactly with values in the database itself. Therefore, if by accident tabs or spaces have been entered into a field, the database query will not pick these entries up unless it is specifically instructed to ignore such anomalies. One of the best workarounds is to use wildcard values or 'approximate' matching such as the 'LIKE' function in SQL. Table 2 Common problem characters
Another issue pertinent to historical databases is the use of null (blank) entries and the method by which the database creator has indicated a transcription difficulty. The key here is to establish firm rules as part of the overall transcription process (Section 3.3.1). Despite what one might naturally assume, database systems are quite happy dealing with blank entries. Indeed, most good systems (especially those based on SQL) recognise a blank field easily. There are, however, a number of examples of historical databases where, instead of using a null entry, the designer has opted to identify blank fields with the use of a special character or string. Common usage includes "-", "---", or "blank". This approach does unfortunately have the propensity to create more difficulties than it solves, particularly if there is a general lack of consistency. As a generic guideline, creators are encouraged to leave null entries as null (i.e. empty) rather than using any particular character or string to represent this. A different, and more complex, issue is the problem of uncertain transcription. Historians using computers are well aware of this dilemma and a number of approaches have been adopted in order to address this aspect of databases (GSU 1988). In much the same way as dealing with nulls, creators are strongly advised to establish a framework and methodology for tackling partial or uncertain transcription during the database design stage. The rules have to be in place before transcription begins (with all transcribers aware of these rules) in order that consistency is maintained. Moreover, consistency is vital for immediate analysts of the data source and future scholars. Two common approaches to this issue are to use some sort of confidence weighting indicator, or a tag that identifies when a data value has been transcribed with some uncertainty. The first approach attaches a statistical measure of confidence to a particular entry where there has been some difficulty in transcription. An example might be to use a weight from 0 (no confidence) to 1 (full confidence) as documented by Thaller (1993). In relational database systems this may not be a particularly elegant solution, as it may involve adding mostly redundant fields to a table. The design of a 'transcription look-up table' (a separate table linked by a unique identifier with fields indicating the field name and weighting factor) might be a workable approach. Certainly the use of tags can be a rather inelegant workaround. Databases that use character-based tags to indicate transcription anomalies tend not to work particularly well. The use of question marks or brackets to draw attention to problematic entries can be confusing, especially in cases where the source itself includes question marks and brackets. If a database is to include a relatively detailed set of indicators of transcription confidence then this should be thought over carefully and should be implemented consistently. | ||||||||||||||
|
© Sean Townsend, Cressida Chappell, Oscar Struijvé 1999 The right of Sean Townsend, Cressida Chappell and Oscar Struijvé to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service. Electronic or print copies may not be offered, whether for sale or otherwise, to any third party. |