Instructor: Brian Klinkenberg

Office: Room 209
Office Hours: Tues 12:30-1:30
Wed 12:00-1:00

Lab Help: Jose Aparicio

Office: Room 240D

Computer Lab: Room 239


 

 

Introduction to Multiple Regression and GWR

Understand geographically-weighed regression (GWR) requires getting your mind around several concepts -- multiple regression and spatial interpolation being the two most important ones. However, it also involves Monte Carlo simulation, the use of advanced statistical parameters, and much more. For the purposes of this course you are not expected to become experts in the technique, although it is expected that you conceptually understand the method and can describe in your own words what it does and why you might want to use it. (A page that describes linear regression, one that describes a variety of analytical techniques.) My presentation on regression, a short but comprehensive description.

In order to provide a simple explanation of the method, I will use an example related to health and GIS. Across many countries it has been found that there is a direct relation between income (as represented by GNP) and the health status (as represented by life expectancy) of an individual (the example below is obviously a nonlinear example, but I will assume a linear relation in my discussion below):

However, recalling one of the 'take home' messages from the landscape ecology lectures--when observing a pattern it is always important to consider the processes that might be responsible for creating that pattern. That is, what determines a person's income, and what plays a role in determining life expectancy? How do all of those factors, combined, produce the pattern that is observed? Understanding all of that is beyond the confines of this course, of course. However, we can examine one small element of that complex set of relations--income and the social-economic determinants of income.

Let us consider the determinants of a person's income. We know that age is obviously a factor (older people earn more simply because they have been in the workforce longer), as well as educational achievement (it is well known that each degree you earn adds to your income potential) and family status, among others. How can we go about determining how much each of these different factors contribute, in aggregate, to a person's income? The most commonly used method to examine multiple relations among variables--wherein a functional relation between the dependent variable and a group of independent variables is assumed--is multiple regression.

First--we need to consider bivariate linear regression. Consider income as the dependent variable, and age, educational achievement and family status as the independent variables. We can determine the relation between each of the independent variables and the dependent variable one at a time.

What if we observe, after conducting several bivariate regressions (e.g., age versus income, education versus income, family status versus income), that each of the independent variables exhibits a significant relation with the dependent variable. What should we do? How can we consider the combined influence of each of the independent variables on the dependent variable, while at the same time taking into consideration any correlation among the independent variables (e.g, age and family status are obviously highly correlated)? That is where multiple regression can play a role.

Consider, for purposes of explanation, that we perform a bivariate regression using age (independent variable) and income (dependent variable). Using the predicted values for income (income = intercept + b * age) we can determine the income residuals (actual income - predicted income).

We can then perform a second bivariate regression using educational achievement (as the independent variable) and the income residuals (as the dependent variable). The resultant prediction tells us how much of the variation in income (after accounting for the variation in income explained by age) can be explained by educational achievement (independently of any relation between educational achievement and income). We can continue this process for every independent variable.

However, it should be obvious that to perform an analysis such as this would be a complex process, since we should conduct the analyses using each of the independent variables in every possible combination (e.g., run the bivariate regression between family status and income first, then use age and the residuals left from the family status regression, and then use educational achievement, and then reorder the independent variables and run the routines again). We would then select that order/combination that produced the "best" explanation of the relation between all of the independent variables and the dependent variable. Multiple regression routines do this selection process for us (although it must be admitted that there are many different ways of determining "best", so the field of multiple regression is very broad).

Now, consider doing a multiple regression that is also sensitive to the geographic relations among the variables. That is, unlike traditional aspatial regression wherein every individual (case) is included in the regression analysis, geographically-weighted regression uses a methodology similar to that used in inverse-distance weighted spatial interpolation, and says that nearby individuals (cases) should influence the regression equation more than far away individuals (cases).

Doing so will result in a much more complex decision process than that observed with multiple regression, since not only do we have all of the complications associated with standard statistical techniques, we also have to consider how to define "near" and "far" (and consider MAUP), and how to determine the significance of the results when performing so many analyses (one for every individual, rather than one analysis for all of the cases). That is where Monte Carlo analysis comes into play, as well as the use of sophisticated statistical parameters.

A simple example of a Monte Carlo analysis: say you deal yourself a poker hand with four aces and want to know how likely it would be to deal yourself another hand with four aces. One way to determine how likely it would be is to deal yourself 99 hands (shuffling the deck after each deal). If, in those 99 hands you never dealt yourself another hand with four aces, then the chances of observing a hand with four aces would be 1 (the one hand with four aces) / [99 hands, none with four aces + the one hand with four aces] = 1/100 = 1%.

When GWR uses Monte Carlo analysis, the program randomly allocates the case attributes to the cases. Imagine two lists: houses [the cases fixed in space] and the $ value of the house [an attribute of a house; the case attributes]. Prior to the random shuffling we might observe that houses of the same $ value tend to be clustered together. Thus, the variance in housing values within a given distance would be small [e.g., all houses in a neighbourhood tend to be of similar $ value, and therefore the variance in $ values within each neighbourhood would be small]. We then randomly assign the $ values to the houses. After shuffling the $ values around we observe that variation in $ values within each neighbourhood is much greater. After doing this random shuffling 99 times, we observe that the variance in the $ values is least when the houses have their true $ values assigned to them. Therefore, there is a 1% chance that such an assignment of $ values to houses would arise by chance (i.e., geography matters). On the other hand, if we observed that 50 times out of the 99 trials the variance in $ values within a neighbourhood is actually smaller than that observed when the true $ values are assigned to the houses, we would then say that the odds are 50% that such an arrangement of houses/$ values could arise by chance (i.e., geography doesn't matter).

Some papers that explore the issues of the relation between Health and Economic Status: