Announcing the UK Data Service - what you need to know
 
| HDS | Home | A-Z index | Site map | Contact |Search site
blank space

Planning historical digitisation projects

Planning Historical Digitisation Projects

This introductory guide highlights important issues that anyone planning to create a digital resource for research, learning or teaching in history or an associated discipline should address.

Project Management

Digital resource creation projects need to be managed with a degree of formality. A large team of historians, research students and IT specialists can be involved in the project, and good project management is vital in ensuring that work is coordinated and delivered on time.

Project Timetable

The basis for a sound project timetable is a clear understanding of how long each task will take and the order in which they must be completed. The time allocated to each task must be based on realistic estimates of the effort required. For example, a project planning to create a collection of scanned images will need to consider the number of original images and how long an image of a particular type and size takes to: retrieve from storage; prepare for scanning; scan; process (the digital image); and return to storage.

Once the main tasks have been identified, and the time needed to complete them has been estimated, a project timetable can be drawn up. The timetable should show how long each task will take, the order in which tasks will be started and finished, and what the members of the project team are doing at any given point in time. Links between tasks should be clearly specified in the project timetable. A timetable like this will help to reveal problems such as staff who are committed for more than 100% of their time, or tasks scheduled to begin before pre-requisite tasks have been finished.

By identifying all the interdependencies between tasks in your project, the critical path of the project can be identified. The critical path is the sequence of tasks that must all be completed on-time for the entire project to be completed on time. A delay to any task on the critical path will delay later tasks, and the entire project will fall behind schedule.

Monitoring Progress

A good project timetable is a useful tool for monitoring progress. Regular reports by those responsible for each task can be compared with the progress anticipated in the project timetable, allowing problems to be identified and resolved as early as possible. Progress can only be monitored with up-to-date information, and this is best provided through some formal (but possibly quite simple) framework of meetings and reporting intended to share information about progress between project members.

Project Managers

Although reports may be read by everyone, and meetings attended by everyone, ultimately there must be a person or group who have the clear responsibility and authority to act if the project is not going to plan. Usually, a single project manager is best, supported by a management or advisory committee in the case of larger projects.

Data Development

'Data development' refers to the sequence of technical stages needed to convert an original source into a digital resource. The data development methods a project chooses will depend on the nature of the sources used and the intended purpose of the digital resource. There are, however, a number of stages in data development that any project will need to address.

Acquiring Sources

The acquisition of source material to digitise can be complicated by restrictions on access to and use of the material. Your ability to use material in archives may be restricted by opening hours, travel costs, and archive regulations that preclude scanning or digital photography. It may be more efficient to compromise and work with a facsimile (a photograph, microfilm, published edition etc.) of the original. In either case, you must ensure that you have the rights to create and, if intended, distribute digital versions of the source or facsimilie.

Identifying Aspects of Source to Digitise

Digitisation only creates a partial representation of the original source, not an exact copy. The most important aspects of a source to digitise will vary depending on the purpose of the digital resource being created. For example, a project examining the history of journalism will be most interested in capturing computer readable text from newspapers. A project studying the history of the printing trade, however, may be more interested in capturing digital images of the newspaper pages. In the first case a searchable text is created, in the second a visual representation of the original page is created; in both cases some information is not captured in the digitised version of the source. The most appropriate method of digitisation should be selected by considering the purporse, and potential future purposes, of the digital resource.

Digitising Historical Sources

Once the digitisation objectives of a project are clear, the most appropriate data development methods can be selected. Most projects will be working with documental sources and will be interested in capturing machine-readable text or images of the original documents (although a range of other digitisation methods, such as digitising analogue sound recordings or creating virtual three dimensional models are also possible).

Transcription and Optical Character Recognition

Transcription remains the only practical way of capturing computer readable text from handwritten documents. Optical Character Recognition (OCR) is currently most useful for processing large numbers of clear typed documents. Faint text, archaic characters, missing sections and confusing layouts can all slow transcription considerably and are likely to make OCR impracticable.

Neither transcription or OCR is 100% accurate, so some form of proof-reading should be done. Many OCR software packages provide tools to help with this. Providing digital images of the original source alongside the computer readable text is a good method of allowing the user to do their own proof-reading of the text.

Once the basic text from the source has been digitised, the logical structure of the document, or aspects of its visual layout, can be recorded using markup.

Digital Image Capture

Key decisions to make are the colour depth and resolution to capture. It is also important to consider the amount of metadata that should be provided with each image and how that metadata will be stored.

For more information on working with digital images read the Visual Arts Data Service's Guide to Good Practice, Creating Digital Resources for the Visual Arts: Standards and Good Practice

Images of charts, maps and diagrams can be captured as bitmaps, but a better alternative may be to capture vector images of them. Vector images store information about points, lines and polygons.

Organising a Digital Collection

In addition to digitising the individual documents, images etc. a project must consider how best to organise the collection of digitised objects that is created. Some of the main ways a digital resource can be conceived of are described below.

Catalogues, indexes and other finding aids

No direct digitisation of source documents may be necessary, but information must be extracted from the source and organised according to appropriate metadata standards.

Source based databases

Source based databases usually contain small pieces of information extracted from a range of original sources and reorganised as fields in a relational database. Codes and standardisation rules are often applied to make the data easier to use, but these modifications should be included as additions rather than replacements for the original values taken from the sources. Databases can also store images, sound files, long text files and other types of digital data.

Markup documents

Markup documents have many uses. Specifically in relation to historical documents they can be useful when you want to identify specific items of information in documents such as letters and journals without destroying the original flow of the document (as would happen if the information was entered into a database) For more information about markup see the Oxford Text Archive's Guide to Good Practice, Creating and Documenting Electronic Texts.

Geographical information systems

Geographical Information Systems (GIS) store and analyse spatially referenced information. Scale and resolution are key issues to address, particularly given the possible inaccuracies in historical maps. For historical projects the availability of information that can be related to spatial locations is likely to be a limiting consideration.

Making a Resource Available

The most appropriate way of making a resource available will depend both on the nature of the resource and the target audience. The History Data Service can preserve and distribute any resources deposited with us, but ambitious projects may want to develop their own advanced means of distribution.

Resources aimed at students, large bodies of researchers, or the public will need to have a well thought out method of distribution. Providing information via CD/DVD or setting up a website, possibily with access to downloadable files, are the most popular methods.

While a small static website can be set up quite easily, complex dynamic sites require considerable time and expertise to develop. If you plan a website that will allow users to retrieve information from a database, interact with a map, examine and zoom images, view movies or similar options, you should ensure that you fully understand the technical tasks involved. Projects planning complex websites should include a web development officer in their team, or get quotes from commerical companies.

Piloting

Small scale trials of techniques are useful ways of highlighting unexpected problems and potential pitfalls in data development. Including a formal piloting stage in your project can be worthwhile. It can be used to test the feasibilty of different methods of digitisation, or check how your target audience responds to the resource.

Technical Support

The type and level of technical support a project will need will vary. Basic technical support (advice on procuring software and hardware, support for common software packages, network connections etc.) should be provided by your institution. It is important to check exactly what support can in practice be expected. If you plan to rely heavily on your institution for technical support, you should discuss your requirements before the project starts.

If your institution supports particular software packages then, other things being equal, you should use those software packages.

Working within the limits of available support can have advantages. The requirements of your web host can, for example, simplify your choices between servers, scripting languages, database servers and the other elements involved in a complex website.

In many cases those planning the project, or staff hired for the project, will bring important technical skills with them. Where a discrete and relatively small technical task must be completed, a project can consider hiring technical consultants. Setting up a project website, for example, might be contracted out. When staff in the project hold critical skills, it is important to plan for the possibility of illness or staff leaving, as some skills are rare and individuals can not be easily replaced.

Backup

Backing up data is a basic precautionary step that everybody working with computers should take. Backup copies are an insurance policy against the possibility of your data being lost, damaged or destroyed.

Requirements for a Good Backup Policy

A good backup policy will protect your data from a large range of mishaps. The range of events that you should consider when planning how to backup your data includes:

  • Accidental changes to data
  • Accidental deletion of data
  • Loss of data due to media or software faults
  • Virus infections and interference by hackers
  • Catastrophic events (fire, flood etc.)

A good backup policy should provide protection against all of these threats.

Frequency of Backup

Backups should be made regularly to ensure that they remain up-to-date. The more frequently data is being changed the more frequently backups should be made. If your data is changing significantly every day you should consider a daily backup, but if you are prepared and can afford to redo a longer period of work then less frequent backup may be appropriate.

As well as backing up frequently, you should keep several backup copies made at different dates. Doing this guards against the danger that your backup copy will incorporate a recent, but as yet undiscovered problem, from your working copy.

Multiple Backup Copies

A backup copy may suffer the same mishaps as the working copy of your data, so it is a good idea to spread the risk by maintaining several backup copies. A minimum of two backup copies should be maintained in addition to your working copy of the data.

Offsite Backups

More serious events, such as a fire in the office, will destroy both the working copy of the data and any backup copies stored at the same location. Some backup copies should be stored 'offsite' (offsite is a relative term, dependent on the level of protection you want).

As well as storing some copies offsite, it is useful to keep a backup copy onsite. This copy can be quickly retrieved and work recommenced if there is a minor mishap, such as the accidental deletion of an important file.

Media

Backup copies should be made on new media. Do not continue to use media once they start to develop faults. Specifically, floppy disks are not a good media for backup copies. If they are used, they should be replaced often.

Store backup copies on multiple media (e.g. zip disk and CD-ROM) to avoid all your backup copies becoming corrupted by the same drive or disk fault.

Multiple Formats

Store backup copies in both the software formats that you are using and in exported formats (many spreadsheets and database packages can exported to delimited text for example). This will help protect you from subtle faults that can sometimes develop in complicated data formats (such as database file formats) that may not become apparent until after they have been included in both the working copy and the backup copies.

Institutional Backup Policy

Projects should never assume that their institution's policies will be appropriate to their needs. Always check.

  • Institutions may maintain backups for a limited period
  • Institutions may only provide backups to protect against complete loss of data, and not individual users losing data
  • Institutions may not backup all data held on their network

Many organisations advise their users to make their own backups of critical data. This is good advice and should be followed.

Check Your Backup!

A backup that does not actually work is of no use at all. Always test your backup procedures to ensure that your backup can be retrieved and is useable.

Backup is not Preservation

A backup copy is an exact copy of the version of the data you are working on. If your working copy becomes unuseable, you should be able to start using your backup copy immediately, on the same computers, using the same software.

In contrast, a preservation version of the data is designed to mitigate the effects of rapid technology change that might otherwise make the data unuseable within a few years.

Preservation

Whereas backup protects data from damage and destruction in the short term, digital preservation deals with the long-term survival of data, ensuring data remains useable as hardware and software changes. Successful digital preservation requires an organisational framework that can ensure digital resources remain useable as technology changes.

You can deposit data with the History Data Service for preservation and dissemination.


History Data Service > History > Create
 
_
  Valid XHTML 1.0!
  Page last updated 30 April 2008
© Copyright 2003-2012 University of Essex. All rights reserved.
Contact   |    Copyright, disclaimer and privacy policy    |    Accessibility
Link to University of Essex Link to JISC Link to ESRC