A place in history: a guide to using GIS in historical research


CHAPTER 7: SPATIAL ANALYSES OF STATISTICAL DATA IN GIS

 

Guide to Good Practice Navigation Bar

7.2 What makes spatially referenced data special?

Spatially-referenced data have special characteristics that offer both advantages and disadvantages to the researcher wanting to perform statistical analysis. The advantages are basically that they allow us to ask questions such as 'where does this occur?', 'how does this pattern vary across the study area?', 'how does an event at this location affect surrounding locations?', and 'do areas with high rates of one variable also have high rates of another?'. Conventional statistical techniques such as correlation and regression tend to produce a summary statistic that quantifies the strength of a relationship within a dataset or between two of more sets of variables, for example, a correlation coefficient of 0.8 between X and Y. This approach is termed global or whole-map analysis and is undesirable from a GIS perspective because it ignores the impact of space by simply providing a summary of the average relationship over the whole study area. In reality any relationship is likely to vary over space and when performing analysis on spatially-referenced data we should attempt to use techniques that allow us to examine not simply what the average relationship is over the whole map, but how the relationship varies across the map. This is termed local analysis (Fotheringham 1997). Some examples of how this can be done are given in section 7.3. For now it is sufficient to make the point that by using techniques that explicitly incorporate the spatial component of the data we are able to develop a more sophisticated understanding from the data. Ignoring the effects of space, as many traditional techniques do, means that we are limiting the understanding that can be gained from the data.

There are four major disadvantages that make spatially-referenced data special. These are: data quality and error propagation; spatial autocorrelation; the modifiable areal unit problem (MAUP); and ecological fallacy. Many of the sources of issues of data quality with spatially referenced data were discussed in Chapter 4. Here it is sufficient to say that there is always error, inaccuracy, and uncertainty in the spatial component of data. When data are combined through overlay or similar operations the error and uncertainty become cumulative, a process known as error propagation. GIS software packages do not handle this well, as their data model only allows a single concisely defined representation for each spatial feature, so the possible impact of error must always be considered.

Many statistical techniques make the assumption that the observations used in the study are independent pieces of evidence. This is usually referred to as being 'independently random'. If we are studying the locations of the data to try to find a geographical pattern then we are working on the principle that the locations are being influenced by some underlying cause that varies over space. The data are referred to as being spatially autocorrelated, and this invalidates the assumption of independent randomness. Spatial autocorrelation is similar to temporal autocorrelation but is more complicated, as it operates in many directions simultaneously. The degree of spatial autocorrelation in a dataset may be quantified, as is described later in this chapter.

Many socio-economic datasets, for example the census, are published as totals for administrative units. The boundaries of these units are, to all intents and purposes, arbitrary and random, defined as a result of politics and inertia rather than because they say anything meaningful about the population that they are sub-dividing. The question this raises is: if the arrangement of the administrative units were changed, would the results of any analyses based on them also change? Openshaw and Taylor (1979) vividly demonstrated that it could. They compared data on the percentage of the population voting Republican in the 1968 Congressional election with the percentage of the population aged over 60 for 99 counties in Iowa. By aggregating these data to six regions using different arrangements they could produce correlations from -0.99 to 0.99 and almost any result in between. In other words, the results of the analysis were totally dependent on the way in which the data were aggregated. Similar phenomena have also been demonstrated using regression analysis (Fotheringham and Wong 1991), and they occur as a result of two effects combining. The first, the scale effect, simply means that as data are aggregated they become increasingly averaged or smoothed as extreme values are merged with more normal areas. The second is connected to the actual arrangements of the boundaries which can lead to results being 'gerrymandered' in a similar way to election results. Taken together these effects are referred to as the modifiable areal unit problem (MAUP).

At one level the MAUP is highly worrying as it means that the results of any analysis using spatially aggregate data are highly suspect and could simply be the result of the administrative units used to analyse the data. Some statistical and GIS-based approaches to the MAUP have been suggested (see Openshaw and Rao 1995; Fotheringham et al. 2000), but these are either statistically or geographically complex and are not entirely satisfactory. A more pragmatic approach is to avoid additional aggregation by using the data in as close to their raw form as possible while interpreting the results of any analysis bearing in mind the possible impact that modifiable areal units may be having.

Ecological fallacy is closely related to the modifiable areal unit problem. It deals with the fact relationships found in aggregate data may not apply at the household or individual level. For example, if areas with high rates of unemployment also have high crime rates it is a mistake to think that a person who is a criminal is more likely to be unemployed than employed. This has been known since the 1950s, yet the increasing potential to analyse spatially aggregate data in GIS means that there is a temptation to ignore it. There have been attempts to find mathematical or statistical solutions to ecological fallacy but, as with the MAUP, these provide computationally and statistically complex ways into the problem without actually resolving it.

Guide to Good Practice Navigation Bar
Valid XHTML 1.0!
 

 


© Ian Gregory 2002

The right of Ian Gregory to be identified as the Author of this Work has been asserted by him in accordance with the Copyright, Designs and Patents Act 1988.

All material supplied via the Arts and Humanities Data Service is protected by copyright, and duplication or sale of all or any part of it is not permitted, except that material may be duplicated by you for your personal research use or educational purposes in electronic or print form. Permission for any other use must be obtained from the Arts and Humanities Data Service.

Electronic or print copies may not be offered, whether for sale or otherwise, to any third party.


Next Bibliography Back Glossary Contents