Geography 570
Graduate GIS Seminar

Page Selection
Spatial Stats I: The Modifiable Areal Unit Problem

The modifiable areal unit problem is endemic to all spatially aggregated data. It consists of two interrelated parts. First, there is uncertainty about what constitutes the objects of spatial study-identified as the scale and aggregation problem. Second, there are the implications this holds for the methods of analysis commonly applied to zonal data and for the continued use of a normal science paradigm which can neither cope nor admit to its existence. The following notes have been derived from the references listed below (some of the text is quoted without attribution). See also Lisa Oliver's notes on MAUP.

The scale effect is the tendency, within a system of modifiable areal units, for different statistical results to be obtained from the same set of data when the information is grouped at different levels of spatial resolution(e.g., enumeration areas, census tracts, cities, regions).

The aggregation or zoning effect is the variability in statistical results obtained within a set of modifiable units as a function of the various ways these units can be grouped at a given scale, and not as a result of the variation in the size of those areas. Figure

The second problem follows from the uncertainty in choosing zonal units. Different areal arrangements of the same data produce different results, so we cannot claim that the results of spatial studies are independent of the units being used and the tasks of obtaining valid generalizations or of comparable results becomes extraordinarily difficult. (Consider a comparative study among Vancouver, Toronto and Montreal--some of the differences may be due to the way the data is aggregated in each city.)

MAUP therefore consists of two problems--one statistical and the other geographical, and it is difficult to isolate the effects of one from the other.

Consider the typical approach taken in a multiple regression analysis--using a set of independent variables the analyst attempts to produce the 'best' fit to the dependent variable using routines such as stepwise regression, logit regression, etc. However, the results obtained may be entirely dependent on the spatial units being used-by altering the number of and arrangement of the spatial units a completely different set of results would be obtained. Traditionally we only concern ourselves with the first problem-producing the best fit given a particular set of statistical tools--and ignore the second problem--are the spatial units being used the most appropriate ones for the problem at hand? As Openshaw (1996) states: "The MAUP will disappear once geographers know what the areal objects they wish to study are."

We should identify that two distinct types of spatial units are commonly used in geographic analysis- artificial and natural units. Census data collected for individuals, but aggregated and represented as areas present a major problem in interpretation, and cannot be treated in the same way as areal data such as soil type which are both collected and represented as areas. One could say that physical geographers are therefore somewhat immune to the MAUP, but the classification of a landscape into elements such as the toe of a slope, a ridge, etc., are not without ambiguity in definition, and therefore physical geographers do need to be aware of MAUP. In particular, the scale effect is very much a concern in many studies (e.g., drainage basins: first order, second order, etc.)

Spatial analysis of socioeconomic data and health data often requires aggregation into arbitrary areal units. Such aggregation occurs for the protection of individual privacy and for the computation of rates. For example, average income cannot be determined, like temperature, at a given point, but must be obtained from aggregated data to be meaningful. And, the larger the spatial unit the more statistically stable is the observed rate, obtained at the expense of greater spatial ambiguity. As such, we must often deal with these issues.

Why should it be of concern to everyone involved in the analysis of spatial data?

The MAUP has been known of since the early 30's, when a study of the scale effect in census data the authors noted that "a relatively high correlation might conceivably occur by census tracts when the traits so studied were completely dissociated [in the ultimate possessors of those traits]... individuals or families" (Gehlke and Biehl 1934). This particular effect of MAUP is more formally known as ecological fallacy, and is one of the serious problems which follow from the MAUP.

However, there are other elements which may impact any study using aggregated data-Simpson's Paradox. If the value of one variable varies in correlation with another (e.g., high areas of unemployment are often found in areas with a high number of a particular social-economic characteristic), then it may be impossible to obtain a reliable estimate of the true correlation between the two variables (Table 3.6).

Given that many policy decisions are made on the basis of statistical associations obtained from the analysis of spatial data (e.g., funding for multicultural activities to neighbourhoods on the basis of percentage ethnic population living in the neighbourhood), much more attention needs to be paid to the problem (Figure 2.1 from Wrigley et al., 1996).

Using the MAU effect we can create zonings with particular statistical aims in mind--as illustrated in figure 4.1 (Oppenshaw, 1996).

The question is-why does MAUP occur?

Could it be:

That the aggregation effects are at least partly a result of methodological considerations relating to the appropriateness of the statistics chosen and their application? (That is, statistical assumptions are violated in the application of the particular statistic, and that violation causes the aggregation effect to appear; aggregation in and of itself doesn't cause the problem.)

That the spatial process being examined at one scale may not be the same spatial process examined at another scale and therefore the concept of an aggregation effect inherently flawed. Many processes also do not scale linearly, which can also create analytical problems. Consider the hierarchy theory in ecology (emergent processes).

Amrhein (1995) used random data with known distributions (no spatial correlation) to study the statistical aspects of the aggregation problem. He observed that aggregation does not affect the mean, even after 100's of simulations with aggregations ranging from 10,000 observations down to 9 spatial units. However, while the variances did vary following aggregation, the differences could be explained almost entirely by accounting for the sample size (10,000 down to 9). However, he did conclude that "populations with very high variances are more likely to generate aggregation effects related to zonation than are populations with very low variances" (Page 113).

With real data, however, the situation can be very different. The variability in the statistic obtained can be considerable, as illustrated in Table 2.1 (Wrigley et al., 1996)--a demonstration of both the scale effect and the aggregation effect. Figure 2.2 shows what typically happens to correlation coefficients.

As Fotheringham and Wong (1991) state: "[T]he correlation coefficient for variables of absolute measurement increases when areal units are aggregated contiguously, but there is little equivalent trend for ratio or percentage variables. It is very easy to see why the correlation coefficient should increase as the level of data aggregation increases--the process involves a smoothing effect (by averaging or summing), so that the variation of a variable tends to decrease as aggregation increases. As the correlation coefficient is calculated as:

where cov(x,y) is the covariance of x and y, and Sx and Sy represent the standard deviation of x and y, respectively; when the variances of x and y decrease, the correlation coefficient will increase if the covariance between x and y is relatively stable. Amrhein (1995, 117) observed the same trend: "the trend here is systematic so one may say that correlation coefficients increase in range with decreasing number of zones to the point that the range and standard deviation of the values that might be observed converges systematically on the possible range of values" (i.e, for r to -1 + 1).

Amrhein observed that regression coefficients appear to be sensitive to aggregation scale effects, and these effects increase as the aggregation procedure reduces the number of reporting units. Others have reported that both the correlation coefficient and the slope estimate of the regression did not increase monotonically with aggregation, while others found that the variations of both decreased as the scale increased. Parameter estimates for intensive or percentage variables appear to be less sensitive to different zoning criteria or to zoning effect than those for extensive (total) variables. Green and Flowerdew illustrated the results of one study. These results are probably a function of the underlying data, and the nature of the spatial autocorrelation in it (noting that spatial autocorrelation in the independent variables can produce a different effect on the parameters than does spatial autocorrelation in the dependent variable).

Going back to the root cause of the 'problem.'

I think that the confusing sets of results are a reflection of the unknown nature of the underlying area homogeneity or grouping effects-the fact that geographical areas are made up not of random groupings of individuals /households, but of individuals / households that tend to be more alike than those in different areas. Three main classes of models have been identified:

Grouping models, in which similar individuals / households choose, or are constrained, to locate in the same area / group, either when those groups are formed or through migrations. That is to say, some process has operated and / or continues to operate such that individuals / households are not randomly assigned to areas. Simply said: a tendency for people with similar attributes to choose to live near each other. (A tendency for plants with similar ecological requirements to be located in 'communities'.) (The typical Chinatown.)

Group-dependent models, in which individuals / households in the same area / group are subject to similar external influences. For example, there may be some 'contextual' variable affecting all individuals in the area. Alternatively, some common influence may have operated in the past, the effects of which are still felt. Simply: the effects of other characteristics of the area, which may or may not be available for analysis. (The rain shadow effects felt in the Okanagan Valley, and the dryland communities that result.) (Cancer rates related to Hanford.)

Feedback models, in which individuals / households interact with each other and influence each other, and the frequency / strength of such interaction is likely to be greater between individuals in the same area / group than between individuals in different areas. Simply: a tendency for people living nearby to interact and as a result to develop common characteristics. (The 'British Estates' in West Vancouver.)

These models could all be operating, and be operating at different scales. Therefore, attempting to achieve a perfect understanding of the reasons why MAUP occurs may be impossible. These models describe different ways in which spatial (auto)correlation may be acting on the variables of interest. Ultimately, neighbourhoods are composed of unique combinations of behavioral, social, political, economic and physical environments, and no combination of statistical manipulations may be able to unpack such a complex set of 'actors.'


Some recent research directions

Recent research into understanding MAUP and attempting to model its affects on statistical interpretations has concentrated on incorporating such conceptual models into the statistical analysis. Wrigley et al. (1996) have used grouping variables--selected on the basis of an understanding of the socio-economic history of the region--in an attempt to adjust the observed parameters, obtained through a statistical analysis at an aggregated level (i.e., we should only work with the individual level data, since at any aggregation level there will be scale and zoning effects; however, since we typically don't have such data we should attempt to 'correct' the derived parameters given that we know they will be incorrect in some manner). See Table 3.3 from Green and Flowerdew (1996) for an example of an attempt to account for spatial autocorrelation.

These models may also apply differently in different areas, resulting in a very complex set of possible relations, as illustrated in Jones and Duncan's figures (5.1-5.4). They used such an approach in their analysis of educational attainment (figure 5.6). (Although the analysis looks at a non-spatial problem, the fact that 'context' was found to be of such importance is the important concept here--what we observe now is a function of what was present there before, and unless we take that into consideration we cannot adequately judge our findings.)

As we can see, the MAUP is a very complex problem that will vex geographers, ecologists, economists, etc., for the foreseeable future.

We must also consider that the scale, grain and extent of the region being examined can play a significant role in the results (a discussion of those issues from a landscape ecology perspective can be found here). Notes on spatial autocorrelation--the process that underlies the MAUP--can be found here.


A final question

Should we strive so hard to remove the geography from our analysis? If we remove the 'local' from the data, have we not lost a very important component of the data?

Figures referred to in the notes:

References

Armhein C. 1995. Searching for the elusive aggregation effect: Evidence from statistical simulations. Environment & Planning A, Jan95, Vol. 27 Issue 1, p105.

Gehlke, C. and Biehl, K., 1934. Certain Effects of Grouping Upon the Size of the Correlation Coefficient in Census Tract Material, Journal of American Statistical Association, 29:169-170.

Green, M. and Flowerdew, R. 1996. New evidence on the modifiable areal unit problem. Page 41-54 in P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment. Cambridge: GeoInformation International.

Jone, K. and Duncan, C. 1996. People and places: The multilevel model as a general framework for the quantitative analysis of geographical data. Pages 79-104 in Longely and Batty.

Martin, D. 1991. Geographic Information Systems and their socioeconomic applications. London: Routledge.

Openshaw, S. 1996. Developing GIS-relevant zone-based spaital analysis methods. Page 55-73 in P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment. Cambridge: GeoInformation International.

Wrigley, N., Holt, T., Steel, D., and Tranmer, M. 1996. Analysing, modelling, and resolving the ecological fallacy. Pages 25-40 in P. Longley and M. Batty (eds) Spatial analysis: modelling in a GIS environment. Cambridge: GeoInformation International.


Some additional references:

Spatial Analysis

Bailey, Trevor C. and Anthony C. Gatrell. 1995. Interactive Spatial Data Analysis. Longman Scientific and Technical, Harlow, Essex, England.

Berthold, Michael and David J. Hand. 1999. Intelligent Data Analysis: An Introduction. Springer-Verlag, Berlin.

Chou, Yue-Hong. 1997. Exploring Spatial Analysis in Geographic Information Systems. Onward Press, Santa Fe.

Cliff, A. D. and Ord, J. K. 1981. Spatial processes--models and applications. Pion, London.

Cressie, N. A. 1991. Statistics for spatial data. John Wiley & Sons, Inc., New York.

DeMers, Michael N. 2002. GIS Modeling in Raster. John Wiley & Sons, Inc., New York.

Fischer, Manfred, Henk J. Scholten and David Unwin (eds.) 1996. Spatial Analytical Perspectives on GIS. Taylor & Francis Ltd., London.

Longley, Paul A., Sue M. Brooks, Rachael McDonnell and Bill MacMillan (eds.) 1998. Geocomputation: A Primer. John Wiley & Sons, New York.

Miller, Harvey J. and Jiawei Han (eds.) 2001. Geographic Data Mining and Knowledge Discovery. Research Monographs in Geographic Information Systems. Taylor and Francis Inc., New York.

Mitchell, Andy. 1999. The ESRI Guide to GIS Analysis. Volume 1: Geographic Patterns and Relationships. ESRI Press, New York.

Ripley, B. D. 1981. Spaial statistics. John Wiley & Sons, Inc., New York.

Social Data Analysis

Dale, Angela, Ed Fieldhouse, and Clare Holdsworth. 2000. Analyzing Census Microdata. Oxford University Press Inc., New York.

Openshaw, Stan (ed.) 1995. Census Users’ Handbook. GeoInformation International, Glasgow, Scotland.

Walsh, Stephen J. And Kelley A. Crews-Meyer (eds.) 2002. Linking People, Place, and Policy: A GIScience Approach. Kluwer Academic Publishers, Boston.

Ecology

Dale, Mark et al. 2002. Conceptual and mathematical relationships among methods for spatial analysis. Ecography 25: 558-577.

Dungan, J. L. et. al. 2002. A balanced view of scale in spatial statistical analysis. Ecography 25: 626-640.

Legendre, Pierre et al. 2002. The consequences of spatial structure for the design and analysis of ecological field surveys. Ecography 25: 601-615.

Rossi, R. E. et al. 1992. Geostatistical tools for modeling and interpreting ecological spatial dependence. Ecological Monographs 62: 277-314.



Pages designed Dec 2002 and maintained in 2006 by Brian Klinkenberg