Planning historical digitisation projects
Planning Historical Digitisation Projects
This introductory guide highlights important issues that anyone planning to
create a digital resource for research, learning or teaching in history or an
associated discipline should address.
Digital resource creation projects need to be managed with a degree of formality.
A large team of historians, research students and IT specialists can be involved
in the project, and good project management is vital in ensuring that work is
coordinated and delivered on time.
Project Timetable
The basis for a sound project timetable is a clear understanding of how long
each task will take and the order in which they must be completed. The time
allocated to each task must be based on realistic estimates of the
effort required. For example, a project planning to create a collection of scanned
images will need to consider the number of original images and how long an image
of a particular type and size takes to: retrieve from storage; prepare for scanning;
scan; process (the digital image); and return to storage.
Once the main tasks have been identified, and the time needed to complete them
has been estimated, a project timetable can be drawn up. The timetable should
show how long each task will take, the order in which tasks will be started
and finished, and what the members of the project team are doing at any given
point in time. Links between tasks should be clearly specified in the project
timetable. A timetable like this will help to reveal problems such as staff
who are committed for more than 100% of their time, or tasks scheduled to begin
before pre-requisite tasks have been finished.
By identifying all the interdependencies between tasks in your project, the
critical path of the project can be identified. The critical path is
the sequence of tasks that must all be completed on-time for the entire project
to be completed on time. A delay to any task on the critical path will delay
later tasks, and the entire project will fall behind schedule.
Monitoring Progress
A good project timetable is a useful tool for monitoring progress. Regular
reports by those responsible for each task can be compared with the progress
anticipated in the project timetable, allowing problems to be identified and
resolved as early as possible. Progress can only be monitored with up-to-date
information, and this is best provided through some formal (but possibly quite
simple) framework of meetings and reporting intended to share information about
progress between project members.
Project Managers
Although reports may be read by everyone, and meetings attended by everyone,
ultimately there must be a person or group who have the clear responsibility
and authority to act if the project is not going to plan. Usually, a single
project manager is best, supported by a management or advisory committee in
the case of larger projects.
'Data development' refers to the sequence of technical stages needed
to convert an original source into a digital resource. The data
development methods a project chooses will depend on the nature
of the sources used and the intended purpose of the digital resource.
There are, however, a number of stages in data development that
any project will need to address.
Acquiring Sources
The acquisition of source material to digitise can be complicated
by restrictions on access to and use of the material. Your ability
to use material in archives may be restricted by opening hours,
travel costs, and archive regulations that preclude scanning or
digital photography. It may be more efficient to compromise and
work with a facsimile (a photograph, microfilm, published edition
etc.) of the original. In either case, you must ensure that you
have the rights to create and, if intended, distribute digital versions
of the source or facsimilie.
Identifying Aspects of Source to Digitise
Digitisation only creates a partial representation of
the original source, not an exact copy. The most important aspects
of a source to digitise will vary depending on the purpose of the
digital resource being created. For example, a project examining
the history of journalism will be most interested in capturing computer
readable text from newspapers. A project studying the history of
the printing trade, however, may be more interested in capturing
digital images of the newspaper pages. In the first case a searchable
text is created, in the second a visual representation of the original
page is created; in both cases some information is not captured
in the digitised version of the source. The most appropriate method
of digitisation should be selected by considering the purporse,
and potential future purposes, of the digital resource.
Digitising Historical Sources
Once the digitisation objectives of a project are clear, the most
appropriate data development methods can be selected. Most projects
will be working with documental sources and will be interested in
capturing machine-readable text or images of the original documents
(although a range of other digitisation methods, such as digitising
analogue sound recordings or creating virtual three dimensional
models are also possible).
Transcription and Optical Character Recognition
Transcription remains the only practical way of capturing computer
readable text from handwritten documents. Optical Character Recognition
(OCR) is currently most useful for processing large numbers of clear
typed documents. Faint text, archaic characters, missing sections
and confusing layouts can all slow transcription considerably and
are likely to make OCR impracticable.
Neither transcription or OCR is 100% accurate, so some form of
proof-reading should be done. Many OCR software packages provide
tools to help with this. Providing digital images of the original
source alongside the computer readable text is a good method of
allowing the user to do their own proof-reading of the text.
Once the basic text from the source has been digitised, the logical
structure of the document, or aspects of its visual layout, can
be recorded using markup.
Digital Image Capture
Key decisions to make are the colour depth and resolution to capture.
It is also important to consider the amount of metadata that should
be provided with each image and how that metadata will be stored.
For more information on working with digital images read the Visual
Arts Data Service's Guide to Good Practice,
Creating Digital Resources for the Visual Arts: Standards and Good
Practice
Images of charts, maps and diagrams can be captured as bitmaps,
but a better alternative may be to capture vector images of them.
Vector images store information about points, lines and polygons.
Organising a Digital Collection
In addition to digitising the individual documents, images etc.
a project must consider how best to organise the collection of digitised
objects that is created. Some of the main ways a digital resource
can be conceived of are described below.
Catalogues, indexes and other finding aids
No direct digitisation of source documents may be necessary,
but information must be extracted from the source and organised
according to appropriate metadata standards.
Source based databases
Source based databases usually contain small pieces of information
extracted from a range of original sources and reorganised as
fields in a relational database. Codes and standardisation rules
are often applied to make the data easier to use, but these modifications
should be included as additions rather than replacements for the
original values taken from the sources. Databases can also store
images, sound files, long text files and other types of digital
data.
Markup documents
Markup documents have many uses. Specifically in relation to
historical documents they can be useful when you want to identify
specific items of information in documents such as letters and
journals without destroying the original flow of the document
(as would happen if the information was entered into a database)
For more information about markup see the Oxford Text Archive's
Guide to Good Practice, Creating
and Documenting Electronic Texts.
Geographical information systems
Geographical Information Systems (GIS) store and analyse spatially
referenced information. Scale and resolution are key issues to
address, particularly given the possible inaccuracies in historical
maps. For historical projects the availability of information
that can be related to spatial locations is likely to be a limiting
consideration.
Making a Resource Available
The most appropriate way of making a resource available will depend
both on the nature of the resource and the target audience. The
History Data Service can preserve and distribute any resources deposited
with us, but ambitious projects may want to develop their own advanced
means of distribution.
Resources aimed at students, large bodies of researchers, or the
public will need to have a well thought out method of distribution.
Providing information via CD/DVD or setting up a website, possibily
with access to downloadable files, are the most popular methods.
While a small static website can be set up quite easily, complex
dynamic sites require considerable time and expertise to develop.
If you plan a website that will allow users to retrieve information
from a database, interact with a map, examine and zoom images, view
movies or similar options, you should ensure that you fully understand
the technical tasks involved. Projects planning complex websites
should include a web development officer in their team, or get quotes
from commerical companies.
Piloting
Small scale trials of techniques are useful ways of highlighting
unexpected problems and potential pitfalls in data development.
Including a formal piloting stage in your project can be worthwhile.
It can be used to test the feasibilty of different methods of digitisation,
or check how your target audience responds to the resource.
The type and level of technical support a project will need will
vary. Basic technical support (advice on procuring software and
hardware, support for common software packages, network connections
etc.) should be provided by your institution. It is important to
check exactly what support can in practice be expected. If you plan
to rely heavily on your institution for technical support, you should
discuss your requirements before the project starts.
If your institution supports particular software packages then,
other things being equal, you should use those software packages.
Working within the limits of available support can have advantages.
The requirements of your web host can, for example, simplify your
choices between servers, scripting languages, database servers and
the other elements involved in a complex website.
In many cases those planning the project, or staff hired for the
project, will bring important technical skills with them. Where
a discrete and relatively small technical task must be completed,
a project can consider hiring technical consultants. Setting up a
project website, for example, might be contracted out. When staff
in the project hold critical skills, it is important to plan for
the possibility of illness or staff leaving, as some skills are
rare and individuals can not be easily replaced.
Backing up data is a basic precautionary step that everybody working
with computers should take. Backup copies are an insurance policy
against the possibility of your data being lost, damaged or destroyed.
Requirements for a Good Backup Policy
A good backup policy will protect your data from a large range
of mishaps. The range of events that you should consider when planning
how to backup your data includes:
- Accidental changes to data
- Accidental deletion of data
- Loss of data due to media or software faults
- Virus infections and interference by hackers
- Catastrophic events (fire, flood etc.)
A good backup policy should provide protection against all of these
threats.
Frequency of Backup
Backups should be made regularly to ensure that they remain up-to-date.
The more frequently data is being changed the more frequently backups
should be made. If your data is changing significantly every day
you should consider a daily backup, but if you are prepared and
can afford to redo a longer period of work then less frequent backup
may be appropriate.
As well as backing up frequently, you should keep several backup
copies made at different dates. Doing this guards against the danger
that your backup copy will incorporate a recent, but as yet undiscovered
problem, from your working copy.
Multiple Backup Copies
A backup copy may suffer the same mishaps as the working copy of
your data, so it is a good idea to spread the risk by maintaining
several backup copies. A minimum of two backup copies should be
maintained in addition to your working copy of the data.
Offsite Backups
More serious events, such as a fire in the office, will destroy
both the working copy of the data and any backup copies stored at
the same location. Some backup copies should be stored 'offsite'
(offsite is a relative term, dependent on the level of protection
you want).
As well as storing some copies offsite, it is useful to keep a
backup copy onsite. This copy can be quickly retrieved and work
recommenced if there is a minor mishap, such as the accidental deletion
of an important file.
Media
Backup copies should be made on new media. Do not continue to use
media once they start to develop faults. Specifically, floppy disks
are not a good media for backup copies. If they are used, they
should be replaced often.
Store backup copies on multiple media (e.g. zip disk and CD-ROM)
to avoid all your backup copies becoming corrupted by the same drive
or disk fault.
Multiple Formats
Store backup copies in both the software formats that you are using
and in exported formats (many spreadsheets and database packages
can exported to delimited text for example). This will help protect
you from subtle faults that can sometimes develop in complicated
data formats (such as database file formats) that may not become
apparent until after they have been included in both the working
copy and the backup copies.
Institutional Backup Policy
Projects should never assume that their institution's policies
will be appropriate to their needs. Always check.
- Institutions may maintain backups for a limited period
- Institutions may only provide backups to protect against complete
loss of data, and not individual users losing data
- Institutions may not backup all data held on their network
Many organisations advise their users to make their own backups
of critical data. This is good advice and should be followed.
Check Your Backup!
A backup that does not actually work is of no use at all. Always
test your backup procedures to ensure that your backup can be retrieved
and is useable.
Backup is not Preservation
A backup copy is an exact copy of the version of the data you are
working on. If your working copy becomes unuseable, you should be
able to start using your backup copy immediately, on the same computers,
using the same software.
In contrast, a preservation version of the data is designed to
mitigate the effects of rapid technology change that might otherwise
make the data unuseable within a few years.
Whereas backup protects data from damage and destruction in the
short term, digital preservation deals with the long-term survival
of data, ensuring data remains useable as hardware and software
changes. Successful digital preservation requires an organisational
framework that can ensure digital resources remain useable as technology
changes.
You can deposit data
with the History Data Service for preservation and dissemination.