Discussion
Table of Contents
i. Rationale of Methodology
ii. Interpretation of Results
a. GWR
b. Linear Regression
c. Southern California and British Columbia
iii. Sources of Error and Future Potential
Rationale of Methodology
First I discuss the rationale behind using a Geographically Weighted Regression before discussing some of the results.
Before analyzing for regression we should analyze if our data is clustered. If there is spatial clustering, this gives weight to the argument that the dispersal of these tunicates are not entirely random but that there is some factor that is affecting where these tunicates propagate to. If the tunicate presence is entirely randomly distributed, we may be prudent to skip the regression analysis as it is likely there is no particular factor that is affecting their spread, unless that factor is at the same locations.
Once we have resolved that clustering is occurring and that the tunicates and factors are not randomly distributed, we can turn to regression analyses to analyze what might be causing these potential clustering effects.
Regression methods are used to draw correlations between one or more explanatory independent variables with a responding dependent variables. Spatial regression expands upon this by giving a geographic reference to the analyses and exploring local relationships amongst amongst sets of variables.
Normal regression makes the assumption that relations are the same everywhere in the study area, which is not necessarily true which is where geographically weighted regression can become useful in modeling the spatial variations in a regression relationship. Relationships amongst observations are given more weight in our analyses as spatial points that are closer together have more influence on each other than they would if they were further away. This sort of analysis contrasts with simple linear regression, which weighs the influence of an independent variable with a dependent variable equally, regardless of where they are spatially located.
A geographically weighted regression ("GWR") is basically a local spatial regression that is run at each location or transect of interest. The number of regressions that occur in the model are equal to the number of observations within the model; in our case it is 624. Within the GWR model, there are both local and global spatial regressions, but both of these share in common that closer observations are more alike than distant observations. Therein lies the fundamental difference between a spatial regression, that is our GWR model, and a simple linear regression.
It should now be evident how a spatial regression model benefits this study with regards to its objectives. Take for example the regression of a value of a house in relation to its number of rooms. One house located on a beachfront with 5 rooms could be worth double one house in the suburbs with 10. A global linear regression would show there to be a negative relationship between the number of rooms and home value in this particular case, but a spatial regression model would account for where these homes were.
As an example with regards to tunicate intensity, let us assume we were relating tunicates to the amount of human population, and that tunicate presence was strongly affected by the presence of a longer coast line. If there was a section where the coast line was very linear but the population was high and yet the number of tunicates was low, a simple linear regression model would attribute that in that transect, a low number of tunicates occurs as a result of a high population. A geographically weighted multiple regression would account for this issue and furthermore note that transects that are near this transect would be more similar than transects that are further away.
Running a linear regression model helps augment the results seen in a GWR model so that global and spatial effects can be compared. For example, if there is a slightly higher concentration of marinas in the San Francisco Bay but the marinas were also well distributed along the coast, and a certain dataset of tunicates were present only in the bay, the global regression would recognize a slight correlation while the spatial regression would identify what special parameters in the San Francisco Bay area, such as an increase in the amount of marinas, may be causing the existence of tunicate species while giving less weight in the analysis to areas that are further away from that concentrated region.
Interpretation of Results
GWR
Now that it is established why a GWR model is important and why the methodology was accomplished the way it has, we can run through some of the results and offer interpretations.
If we take the collection of cluster analyses, we notice there is clearly clustering occurring in California with the hot spots, multi-distance spatial autocorrelation and Moran's index results. The most evident is the clustering of tunicate species into four distinct regions - northern California, San Francisco Bay, Los Angeles and San Diego. The presence of marinas, bays and higher populations follow roughly the same regions of clustering, although there are much greater dispersals along the coastline as their presences are less isolated.
By looking at these analysis results, it is reasonable to assume that tunicates are clustered, it is not occurring by random chance and likely something is affecting where these species are clustering. Overall, there are enough observations over space and time to resolve that such a conclusion is significant.
We've established clustering exists and we will try to answer what factors might be causing this clustering. The northern coast of California seems to be most strongly correlated with the presence of marinas and population density, which may be construed as odd as neither population nor marinas are particularly high in the region relative to the rest of California. However, these results can be justified because the Geographically Weighted Regression model looks at that region's spatial characteristics.
The northern half of the coast of California is a very isolated region where there is not much population, bays marinas or coast lines compared to the southern half past San Francisco. Hence, when there is an isolated area where there is a marked increase in tunicates and ports and bays (mainly near the town of Eureka), even if these numbers are less than other areas of the coast, GWR would recognize this factor more than a linear regression would. A linear regression would see that there is indeed some increase in factors, but there is nothing particularly special. A GWR model would recognize that, amongst an area of great isolation, a small increase in the number of tunicates is more significant than a larger increase in the number of tunicates in an area that is already surrounded by a high presence of tunicates, bays and population. Hence, the northern coast's isolation causes the GWR to strongly relate the presence of a few tunicates with a few marinas and a slight increase in population density because the surrounding areas are so barren in these respects.
The San Francisco Bay area is in contrast a region with a high concentration with all of the factors. The GWR model indeed attributes a high correlation between the number of invasives and the bays, as well as noting that the large amount of winding coast line in the region is also having a strong effect. What is unexpected however is the presence of marinas and a high population has little to no effect according to the GWR model with the tunicates. One of the fundamental problems with the population polygon and indeed the transect analysis is that not all of the transects have a population polygon attached. As several of the tunicate observations occur in the middle of the bay where there is naturally devoid of any population, some of these may fall completely outside of a 5km transect that connects with land which would give it a population value of 0. However, most of the transects in the bay area do have population value, and cannot solely be responsible for this relationship.
A similar explanation to that of the northern coast may be used, as perhaps because the densities of the factors are naturally so high around the San Francisco Bay region, the high populations are already expected, or perhaps are too low for the level of tunicate density in the region. Globally, there would likely be a correlation between population and tunicates, but given the high density area the local correlation may not be high enough to be significant. This is akin to an area of housing that already has high value where changes in certain factors that would globally increase a unit value such as adding the number of rooms would be shown to have no correlation, as the area already has a high number of rooms and a high value. Such an explanation also works for the low correlation with marinas.
San Diego and Los Angeles are, surprisingly, not correlated in the same way. Los Angeles seems to be strongly affected by the length of coastline and population, while San Diego is affected more by marinas and population. Similar explanations with the GWR model can be used to explain these factors.
We could try to make interpretations on how the results of bays and marinas differ, as bays seem to have the strongest effect in San Francisco Bay and marinas a greater effect along scattered portions of the coastline, San Diego and Eureka (northern coast). This is likely simply because there are more points defined as "bays" in the San Francisco Bay and more points defined as "marinas" along the coast line, and given that the definition and separation of these two layers was slightly arbitrary and based on intuition, it should not be looked too deeply into.
Looking at the GWR text output summary, we can conclude GWR improves the model as the R-square is higher locally, hinting that spatial relations play a role. The significance of this test is the most dubious, as the ANOVA F-test is fairly high indicating a significance of the overall model, but the individual factor P-values are also high which indicates most of the variables fail the Monte Carlo significance test.
This can be due to a number of explanations:
- Gaussian testing, which looks at continuous data instead of
Poisson testing, which works for count data and would have been more
appropriate for this study.
- The Ports volume layer, which seems to have a disproportionately high effect because there are only a few transects out of the 624 that has a major port, and some of these transects can have a very high value for port trade due to there being a lack of a weighing variable to control the volume to a relative and reasonable level.
- The individual factors are insignificant when correlated with tunicates, but when run together produces a significant global result.
First, as the results indicate, the Logistic model does succeed in being run, but fails to produce any results. Hence it is impossible given the dataset that we have to work with to isolate the issue of running the tables under a Gaussian model, and we must stay with using such a model.
In the GWR with no ports, the results shown are largely the same as they were when GWR was run normally with the ports except for a few minor differences that already has been explained in the results section of this study. Hence we conclude ports are not a major confounding variable despite our suspicions. GWR likely is factoring the distance away from these ports rather than only the transects with the ports themselves, hence if a transect is closer to a port with high volume it will be more affected than if a transect is further away, reducing the bias that ports may introduce. A simple linear regression model would not account for this, and the bias would become more apparent.
The interspecies analysis shows that the species are, naturally highly correlated spatially with each others' presence. This is especially true between Ciona and Styela, where the global T-statistic is very high. Both these species occur in regions very close to one another, and plotting their observances show that their pattern of colonization and spread is very similar.
Didemnum is the species that is most responsible for the high correlations in the northern coast due to their propagation in that region, and while all the tunicates are weighed the same removing them from the analysis may prevent more balanced and expected results where there would be a much lower correlation in the northern coast. The difference between Didemnum and Styela and Ciona is even more obvious when looking at the GWR statistics, as the global R values for Styela and Ciona are similar and are both much higher than Didemnum. However, the local R square values are similar amongst the three, indicating that GWR balances out their differences and that each of the invasive species alone is influenced ("regressed") by some factor in the model. Ultimately, combining all three tunicate species into one layer is best for an overall analysis.
Another side issue to note is that the "direction" of regression - which variable is the independent "x" and which is the dependent "y" - can greatly affect the regression results, as seen in the huge differences in GWR between Ciona and Styela in the results section.
Overall, we can conclude that at least some of these factors are affecting the spread of tunicate invasives, although in what manner varies by spatial location and particular factor.
Linear Regression
The linear results done with Excel mirror closely with the global regression results in the GWR and will not be discussed at depth here, as they are self explanatory. Overall, the best predictors are port volume, followed by bays and marinas, followed by coastal length and then followed by population. It is interesting that population has the lowest R value, as it commonly is the only variable that is deemed significant in the GWR model. F tests in the linear regression model show that the majority of the variables are highly significant, from which we conclude that generally, it is evident and clear that there is some correlation between these independent factors and the presence of tunicate species.
Southern California & British Columbia
The Southern California models show a strong relation between abundance and species richness which is to be expected. This hints that the invasives are not in high competition to each other, as if one species thrives by being highly abundant, there will be an even greater number of species that is in the same area. A favorable habitat/conditions for invasive tunicates is a greater indicator of their level of success than the amount of potential "competitors" amongst other tunicate species.
Finally, the British Columbia results use a markedly different analysis method to reach a similar result in that region: There is clustering of tunicates, and they are affected by a certain factor. In this particular analysis, it is the presence of various types of boat launch data that shows a strong correlation with the presence of tunicates, although other factors were not analyzed for comparison and contrast.
Sources of Errors and Future Potential Studies
Sources of Errors and Future studies
- In the Southern California data set, it would be interesting to also note native species, or relative number of invasives instead of noting all the species.
- Given the limited number of sites and the set number of sites that exist in the southern California analysis, a GWR model cannot be run; more data is needed.
- The distinctions between bays and marinas are vague and there is a large amount of overlap; either more official data that separates the two is needed, or eliminating the seperation altogether in a future study may be prudent.
- Furthermore, the bays and marinas are manually geocoded and hence is likely incomplete, potentially inaccurate and possibly biased based on which sites and sources one uses to get such data.
- Neither a logistic nor a poisson model can be run on this analyses which may lead to errors. Future studies should accomodate for this with either better data sets or more accomodating programs.
- The type of transect analysis that is run in this study excludes quite a number of points because they are not close enough to be within each transect. Perhaps expanding the transect size or coverage in a future study can present a more complete picture.
- Line smoothing of the coast may have eliminated important features that may have affected the coast line length variable.
- Some of these correlations seem self fulfilling, as many of the tunicates were looked for in marinas in bays, so of course one would be more prone to find their presence. Having a data set with presence AND absence of tunicates may be more accurate and revealing.
- Lack of absence data also affects the British Columbia model
- In the British Columbia cluster analysis, more vectors beyond
just boats should be included in future analyses.
In the future, the ultimate objective is to use these correlations to interpolate where these invasive tunicates are most likely to spread, building a map of highest invasive "threat zones". For example, since there is likely a correlation between invasive tunicates and areas where there is boat traffic, we can interpolate that areas with boats but without these nonindigenous tunicates are areas that are "at risk" of invasion, and will be marked as a hot zone.
