Instructor: Brian Klinkenberg

Office: Room 209
Office Hours: Tues 12:30-1:30
Wed 12:00-1:00

Lab Help: Jose Aparicio

Office: Room 240D

Computer Lab: Room 239


 

 

Lab 3: Geographically weighted regression

Fragstats was developed to look at patterns present in areal (patches) data, and while it has been used principally to look at environmental data, the methodology can be applied to urban landscapes (e.g., Luck and Wu, 2002), and could be used in an interesting manner in crime research (e.g., the relation between the complexity of the urban landscape and crime--are more crimes committed in areas of more complex urban landscapes[those containing a mix of landuses], and can we therefore use informed planning to ameliorate potential high crime areas?)--as well as in health research (e.g., looking at the relation between the incidence of lyme disease and the nature of the urban/rural boundary--is the incidence of lyme greater in areas where the landscape is more fragmented?).

Not all of the data that we are interested in looking at is associated with polygons (or raster cells), of course. While we can always aggregate people and events (e.g., car accidents, break and enters, or cancer cases can be aggregated into totals and rates within an area), the MAUP creates potential ambiguities in any interpretation of the results. Therefore, it is sometimes better to work directly with the point data. Doing so opens up several new avenues of spatial analyses, and in this lab you will be exploring one of the newer techniques--geographically weighted regression [GWR].

GWR is the spatial extension of aspatial regression analysis, but is much more. Traditional regression analysis assumes that global statistics adequately describe the local relations that might exist in the data. For example, consider looking at the relation between housing prices and the floor space, lot size, etc. of houses in the city of Vancouver. While we could develop a 'global' model that adequately describes the relation between those variables, knowing what you do about housing prices in the city of Vancouver (e.g., that a house of similar dimensions, age, lot size, etc. in the east side of Vancouver will sell for hundreds of thousands of dollars less than an identical house in the west side of Vancouver), the utility of such a model when looking at neighbourhood-level housing issues would be very doubtful. None-the-less, for decades such models have been the norm in real estate research.

Similarly, consider relating rates of crime or diseases to environmental conditions--local conditions can be much more important than any global relation that might be discovered through a traditional aspatial statistical approach. GWR allows us to explore the local relations amongst a set of variables, and to examine the results spatially using ArcGIS. In this lab we will explore the relation between a child's social skills and a small set of variables related to the child and to their neighbourhood. This data was collected some years ago; the exact locations of the participants have been modified in order to preserve anonymity of the respondents. Information on the Early Development Instrument can be found here.

A database has been prepared for you that has 489 rows (individuals) and the following variables:

Phys_sc
(A value [0-100] that reflects the child's physical abilities.)
Soc_sc
(A value [0-100] that reflects the child's social skills.)
Lang_sc
(A value [0-100] that reflects the child's language abilities.)
LoneParent
(The % of the neighbourhood families that are lone parent.)
Fam4
(The % of the neighbourhood families that have 4 or more members.)
Immigrant
(The % of the neighbourhood families that are immigrants.)

RecImmig
(The % of the neighbourhood immigrants that are recent immigrants [< 5 years in Canada].)

ESL
(The % of the neighbourhood that does not have English or French as their first language.)
Childcare
(The % of the families that spend 30 or more hours on childcare.)
VisMinority
(The % of the neighbourhood that belongs to a visible minority.)
Income1000 and Income
(The average income in the neighbourhood [divided by 1000 for the Income1000 group].)
 

 

Some of the variables are related to the child (Phys_sc, Soc_sc and Lang_sc), while others relate to the neighbourhood (i.e., the Enumeration Area or EA) in which the child lives. The neighbourhood-level variables were selected on the basis that those variables should capture some of the important socio-economic characteristics that might relate to a child's language abilities.

Download the data from G479 -- Lab 3. You should find four files--a shapefile that contains the census data (EAplus.shp), a shapefile of the outline of Vancouver, another showing the major roads (mroads_van.shp), and the data related to each child (help_scores.shp--their scores, plus the census variables associated with the EA each child's home falls within). Open the files in ArcMap.

In order to identify which variables should be included in our GWR analysis we can use several of the new tools provided by ESRI. In the following four steps, use the help_scores shapefile (as the Input Features):

1) Identify the neighbourhood variables that appear to contribute to a child's social skill's score by using ArcMap's Spatial statistics / Modeling spatial relations / Explanatory regression analysis tool. The Dependent variable will be Soc_sc (the child's social skill's score), while the Candidate Exploratory Variables will be ESL, income1000, loneparent, recimmig and visminority. The 'most important variables' will be those associated with the model having the highest AdjR2 values as well as the lowest AICc. Specify an Output Report File and an Output Report Table, but accept all of the defaults related to the Search Criteria. Open the Output Report File (*.txt) in order to review the results of the analysis.

2) Using the most important variables identified in step 1, use the explanatory regression analysis tool to identify the best set of variables to use in the GWR analysis; Soc_sc will still be the dependent variable, but this time include the child's language skill's score [Lang_sc] as one of the Candidate Exploratory Variables). Again, the 'most important variables' will be those associated with the model having the highest AdjR2 values as well as the lowest AICc. Specify an Output Report File and an Output Report Table, but accept all of the defaults related to the Search Criteria. Open the Output Report File (*.txt) in order to review the results of the analysis.

3) Using the most important variables identified in step 2, use the Ordinary least squares tool in order to determine the statistics associated with the best set of variables. Use U_ID as the unique ID field, specify an appropriate output features class, and an Output Report File. In addition, under Additional Options, ensure that both a Coefficient Output Table and a Diagnostic Output Table are produced.

4) Using the same set of variables identified in step 2 (and used above in step 3), use the Geographically weighted regression tool in order to explore the spatial nature of the relations. The Kernel Type should be set to Adaptive, the Bandwidth Method to Bandwidth_Parameter, and the Number of Neighbors to 30.

In order to help in your interpretation of the GWR results we will use another of ESRI's new tools--the Spatial Statistics / Mapping Clusters / Grouping analysis tool, this time using the EAplus shapefile (since we want to group the EAs into 'neighbourhoods'). [Note: In order to exclude those EA's that have suppressed data, you should use a Definition Query to exclude all EA's with an Income of 0--which is easier to implement by including those with an income > 0.] The variables to include in your grouping analysis should be those identified as the most important in step 1; the number of groups to identify will be 4. The unique ID will be U_ID. Set the Spatial Constraints to K_Nearest_Neighbors, the Distance Method to Euclidean, the Number of Neighbors to 8. Provide a name for the Optional Output Report File, and click on the check box beside Evaluate Optimal Number of Groups. When mapping the results of the GWR analysis you could use the groups in order to help you interpret the GWR parameter values.

Some general guidelines to follow:

1) Do not use the default classification scheme (Natural Breaks (Jenks) for your maps. When mapping parameters such as the GWR parameter values that may have a range that encompasses 0 (e.g., -2.3 -- 5.4) you should use a manual scheme with three classes that allows you to clearly identify the positive parameters from the negative parameters (with one class centered around 0).

2) You should explore the results of the GWR analysis by mapping all of the relevant output variables. By doing so you may be able to identify consistent trends or patterns in the results. You may also find it helpful to display the GWR values over a choropleth map (EAplus) showing the distribution of loneparent, recimmig, etc., in order to see if that can help in interpreting the results.

To be handed in on Wednesday, Feb 13th at the beginning of the lab period: A two-page (typed, double-spaced) non-technical explanation of geographically-weighted regression, a discussion of the GWR results (include a brief comparison of the GWR results to the OLS results), two maps illustrating some of your results (e.g., showing the spatial distribution of the parameters), and a two-page (typed, double-spaced) discussion of how GWR could be used in a variety of different contexts.

To help you in your discussion on GWR, and on how GWR could be used in a variety of contexts, here is a page that has references that discuss GWR (http://ncg.nuim.ie/ncg/GWR/refs.htm), and here are the results of a Google Scholar search on GWR. A paper that describes how to map the results of GWR, and links to videos that discuss GWR (presented by Stewart Fotheringham).

Definitions for census variables can be found here--The 2001 Census Handbook Reference (pdf). The complete census questionnaire (pdf) for 2001 is also a valuable reference since it contains the questions themselves. Statistics Canada publishes many documents that describe in detail their products (here); you can also search for a specific document.

Notes: I have included two income variables--one the 'raw' $ average income values, and the other [Income1000] the average income divided by 1000. The reason for doing this simple variable transformation is that GWR does not provide standardized regression values. That is, when interpreting the output parameters, it helps if the input variables all have similar ranges. Since the other census variables, and the readiness-to-learn variables, all have ranges of 0 - 100, transforming income into a variable that has a similar range [0 - 82] will make it much easier for you to interpret the relative contributions of each variable to the predicted value (i.e., a 'unit' of each independent variable is about equal from a strict numerical viewpoint). However, when mapping the variables you might as well map the actual average income values, since that will be easier to interpret.

It is unlikely that we can predict a significant proportion of a child's learning abilities using the variables we have at hand. As I mentioned in class, when working with socio-economic variables and real people, the amount of 'explanation' a statistical model can produce is typically fairly limited, especially when working with a multi-level model such as this (i.e., we are exploring the relation between an individual's characteristics and the influence the neighbourhood may have on that individual). Typically a neighbourhood's characteristics is expected to explain about 2-4% of the variance in an individual's characteristics. None-the-less, it is useful to explore the utility of a methodology such as GWR in this lab since it potentially can significantly contribute to the understanding of the relations among spatially-related variables.