History Data Service
 

Digitising History


CHAPTER 4 : FURTHER DATA AND PRESERVATION ISSUES

 

Guide to Good Practice Navigation Bar


















































































































































































































































































Guide to Good Practice Navigation Bar


4.3 Software, formats and preservation standards

All data creation projects will rely on some sort of software application, whether a simple text editor or full-scale relational database management system. The software used has significant implications for phases of the project from data entry to analysis, and may also affect issues of data formats and therefore preservation standards. Sensible decisions about software and data formats will have beneficial effects not only for the project itself but also for the capacity for future reuse of the resource.

This section will provide some broad guidelines for software selection, use of data formats, and preservation criteria. Many of the suggestions are common sense, although decisions concerning the applicability of particular data formats and methods ensuring that a resource is 'preservation friendly' are largely technical issues. It is important to reiterate that a project that has considered these challenges early on will almost certainly be in an excellent position to meet adequate standards. The pressures placed on research and the frequently expedient nature of resource creation mean that the issue of data preservation is often ignored. Preservation is, however, one of the most important aspects of any resource creation project and deserves considerable attention (Anderson 1992; Higgs 1992).

4.3.1 Software selection

The choice of inappropriate software can severely hamper any type of project. Software selection is usually limited by the choices available within a particular institution, as projects are very rarely funded for overheads such as major software purchases. At the time of writing there are many database systems on the market although the dominance of the Microsoft standard is clear and unequivocal. Despite this de facto standard, there are many database applications installed on university systems or researcher's workstations that appear 'fit for purpose'. This section provides some introductory guidelines about the desired elements of database software that are conducive to the successful completion of project objectives.

A primary consideration is the data capacity of any particular system. A project that envisages the creation of a database of upwards of 100,000 records should research the data-handling capacity of its intended software. Even if the hardware platform has adequate resources, some software products place a limit on the number of records in each table. In general, it is spreadsheet software and not database software that usually imposes record limits (Microsoft Excel, for example, imposes a record limit). However, some database systems, particularly PC-based ones, do struggle to cope with large and complex databases even if no record limit is forced. This is due to the resource-intensive nature of desktop operating systems such as Windows95, Windows98 and MacOS. Larger scale projects are strongly advised to use a UNIX platform or the Windows NT operating system. Table 3 gives a rudimentary summary of the most popular database applications with their associated operating systems. For examples of smaller scale projects using Dbase, consult Champion (1993) and Hatton et al. (1997). The general rule is that if a project estimates that there may be significant resource implications with their database, then using a UNIX-based software system is a safer option. Naturally, migrating platforms halfway through a project is to be avoided, yet employing some of the guidelines highlighted in this chapter should make such migration possible.

Table 3 Operating systems and database software

Small- to Medium-Scale Projects Windows 95/98 Larger Scale Projects UNIX variants and Windows NT
Paradox ORACLE
FoxPro INGRES
Access SQL Server
Dbase SAS
File Maker Pro

As well as handling the data, the software must provide the tools that are crucial for the creation and management of the digital resource. These tools can be classified as follows: Database Design, Data Entry, Validation and Data Query. Database software in general supports these elements, it is simply a question of how each particular package approaches individual functions. The most obvious differences in implementations are between SQL-based database systems, non-SQL-based systems, and those (SQL or otherwise) with graphical user interfaces. As a rule, SQL-based systems (particularly of the UNIX variety) tend to be targeted more at large-scale corporate databases and are, as a result, generally more flexible and powerful, with the emphasis on the management system. They are also designed as 'mission critical' applications with associated improvements in robustness. Recent versions of Microsoft Access and SQL Server combine SQL with the functionality of a Windows-based user interface. For those who envisage networking capabilities, systems with ODBC compliance are very useful - particularly for remote access to a central data store. As a broad guideline it is suggested that projects opt for a database system that supports SQL functions and conventions. For an excellent brief article in using SQL in an historical context see Burnard (1989), also Harvey and Press (1993).

The database software application should support all the types of data that the project intends to digitise and provide a comprehensive array of export formats. For projects that plan to use full text entries, and incorporate them into the database as a whole, only systems that support either full text processing or memo fields are appropriate. In some cases, historians may perceive the need to use more specialised software applications in addition to the core database system. This might ultimately involve migrating the data (or subsets of the data) into other applications, such as statistical software or Geographical Information System (GIS) databases. It therefore pays well for the project to investigate in some detail the supported export formats of their database in combination with the import formats of the proposed second application. If such data transference is likely to be common, then the purchase of a special data migration tool such as StatTransfer or DBMSCopy might be profitable. Even with today's sophisticated software, one particular database's export formats may not match the secondary software's import formats. This is quite apart from the fact that a number of problems and anomalies can occur while transferring data from one format to the other. The bottom line is: if in doubt, test, and double test, before the project begins in earnest.

4.3.2 The importance of data formats

Very often data formats are the forgotten aspect of database creation projects. Generally little thought goes into data formats beyond that which is relevant to the core database software application. However, data interchange is an important consideration and this depends almost exclusively on the data format in which the data are available. Try importing an SPSS portable file or SAS transport file into either Access or Excel and the point becomes clear. In much the same way, the ASCII data from programs such as KLEIO or Idealist are not necessarily suitable for any other application.

Suitable data formats are primarily important for data interchange and data preservation. Specialist interchange programs such as StanFEP (Thaller 1988a) have been developed in the past, although the use of such systems (despite their sound theory) has not been popularly adopted. Creators should consider these issues in relation to their own databases, both as a precursor for future preservation and also as a means to maintain some degree of flexibility with their data source. The key is maintaining a 'neutral' format of the database. A neutral format is one that maintains level of independence between data and software allowing digital resources to be preserved even after the original software has become obsolete. Indeed it is because of a lack of thought about independence between data and associated software that so many scholarly digital resources are now being lost.

It is possible to distinguish between two broad classes of data. One is ASCII data, the other is binary data. The importance of this distinction for database creators is that ASCII data are generally easier to share between applications than binary data. This is because ASCII is about the only de facto non-proprietary data standard currently adopted in computing. Almost all software packages will accept ASCII data in one form or the other, whereas the ability to import a particular binary software-specific format varies from application to application. Data creators are encouraged to learn about how ASCII works and to use it as a supported data format for their databases. The import and export of ASCII-based data files is supported by nearly all database systems.

There are subtle differences between DOS-based ASCII and UNIX-based ASCII which present further complications. The characters themselves are drawn from the same ASCII table, but the end-of-line character (i.e. the value representing a carriage return) has been implemented differently in each system. Creators should be aware that migration of ASCII data from a UNIX platform to a DOS/Windows platform (and vice-versa) must employ the use of a conversion program in order to convert the end of line characters. Ultimately, any transfers of this type must be tested to ensure that they have been successful.

Resource creators must think carefully about which formats their database can support. For tabular databases the data structure is at least non-problematic. The challenge is whether a complex database system can be rendered into an essentially ASCII form with no loss of intrinsic information. This may have implications for documentation, as guides to reconstructing the database are often necessary. In addition, any special software-specific programs would have to be abandoned and it is therefore important that such programs are not crucial to the usability of the data. Historians should consider the extent to which their data source is completely independent from any particular piece of software.

4.3.3 Preservation standards

The cliometric historian Emmanuel Le Roy Ladurie commented in 1979 that 'Tomorrow's historian will have to be able to programme [sic] a computer in order to survive' (Evans 1997, 19). One could add to this statement that 'computer-based historical research can only flourish if the resources survive'. Preservation has hitherto been the forgotten element of digital resource creation projects - large or small. This section will introduce some important guidelines to best practice for preservation.

Firstly, it would be wrong to see preservation practices as simply a nod to secondary analysts. The creators also have much to gain from the application of good practices. Consider the analogy of producing a paper scholarly volume, all copies of which are primed to self-destruct within two years. Gone, therefore, are the possibilities of returning to the resource for future reference, incorporating the data in new research, or revising findings and updating information as new research progresses. Much of the power and value of digital data resources comes from the ease with which they can be reused in these ways. Material that has been created without any concessions to preservation standards could, however, face being consigned to the digital waste bin within a relatively short period of time. This is as much a concern for the creators of the database as it is for anyone else.

As suggested in Section 4.3.2, ASCII is really the only de facto cross-platform data standard available at the present time. This rather sad fact is indicative of the short-term focus present throughout the computer industry. Data standards other than ASCII represent the current position of a particular software application in the overall market place, or at least the general popularity of a software application among data users. These formats are usually binary in nature and although some of them have been established over ten years or more, there are no guarantees that such resources could be preserved for an unspecified time-span. From a preservation perspective, the problem with a binary format is that it runs the risk of becoming unreadable owing to unforeseen circumstances. It is a reality of life in the computer industry that software companies come and go, software formats become updated (sometimes with no backwards compatibility), applications change their supported formats (how many applications supported HTML three years ago?), and once fashionable solutions to a data problem can be replaced with an improved method leaving supporters of the old solution high and dry.

The HDS has considerable expertise in dealing with formats for historical material. Table 4 indicates the formats that the HDS recognises as being suitable for deposit.

Table 4 Supported database formats for deposit with the HDS

ASCII Binary Software Specific
Comma separated variables Access
Tab delimited variables Dbase (not strictly binary)
Fixed width Paradox
SQL definitions and set-ups FoxPro
Other delimited variables Excel
Lotus 1-2-3
Quattro Pro

The above table suggests some binary formats that simply represent current standards today, although they are not standards in the strictest sense. All of the above applications have gone through a number of versions over recent years. The HDS as a general rule favours the latest versions of the software at any given moment in time. The rationale is that such software-specific formats can easily be migrated to a neutral ASCII format for preservation purposes. Most of the above software formats are generally interchangeable, especially in relation to Microsoft products. In most cases the HDS will maintain both a binary and ASCII form of the database if the initial deposit has been non-ASCII. Resource creators should avoid at all costs the use of very specific software features. These usually include special programs using a proprietary language, macros, menu systems, or other software-specific features that strongly influence the use of the database and cannot be replicated.

Historians should ensure that their databases meet what are essentially de facto data-sharing standards. The best way to achieve this objective is via the use of ASCII data copies, specifically aimed towards a preservation-ready duplicate. Almost all popular desktop database systems will export as ASCII, and SQL-based systems can export tables according to a user-defined specification. Whilst creators may feel uneasy deconstructing their databases in this way, it is really of no consequence as long as the database can be successfully reconstructed. The responsibility for establishing materials that are re-usable lies with the creator. A data archive can only work with what is deposited. Ultimately the role of a data archive is to ensure the long-term usefulness of a resource by the implementation of a preservation strategy which takes account of changing technical regimes and environments. At the outset, however, it is the resource creator who determines the likely life span of the material.

 

© Sean Townsend, Cressida Chappell, Oscar Struijvé 1999

The right of Sean Townsend, Cressida Chappell and Oscar Struijvé to be identified as the Authors of this Work has been asserted by them in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.


Next Bibliography Back Glossary Contents