Methodology
Table of Contents
i. Introduction
ii. Data Collection
iii. Data Preparation
iv. Data Analysis
Introduction
This project will use the various methods of regression to look at different sets of tunicate data. The first set combines the presence observations over time for Styela clava, Ciona savignyi and Didemnum sp. along the California coast line. The second set comes from a paper which has extensively looked at a set of 27 locations in southern California over a period of 6 years to analyze the changes in biodiversity and abundance over time. This second tunicate data set differs significantly from the first both in measuring population density and species richness as well as adding a temporal element to the study, which is why two sets of analyses were done for the project. As well, the second data set is only for a small geographic area whereas the first data set covers along the coast line.
Both these data sets will be placed into 5km transects drawn along the coast, where in each transect they will be joined with several variables - marinas, harbors, port volume, population density and length of coastline within the transect - whereupon cluster analyses and linear, spatial and geographically weighted regression will be assessed along the transect.
To augment this study, references to another study (Cathryn Clarke Murray, 2008) will be made here, which utilizes HotSpot cluster analyses to assess the patterns of tunicates in British Columbia. As this project focuses solely on the California coast and uses a regression analysis, it is worthwhile to compare and contrast the differing methodology and geographic region of both studies and present it as a joint result for further discussion and research.
Data Collection
In this study, we identify the study area as the coast line of California, and isolate the following as possible factors which may be correlated with the propagation of invasive tunicates:
Tunicate (Dependent Variable) data
- Tunicate presence along the California coastline, to be treated
as sightings
- Sources:
- Paper articles - Lambert, Bullard, etc.
- Websites such as OBIS, GBIF and San Francisco Bay's Exotics
Guide
These were taken and geocoded. If it was only a place name, the location was found to the best of abilities and then the latitude and longitude was taken through Google Maps. Complete references to all of the papers and databases used to locate these sightings are listed under references.
To obtain the most complete picture and get as many possible relations as possible, little restraint was excercised when selecting what sightings to incorporate. Observations from as early as 1990 were recorded down even if they do not resurface in later data collections; this is because we cannot be certain that the tunicate species are no longer present. Furthermore, if too many data sources were excluded it would be difficult to perform an analysis along the entire coast line.
The second data set only contains tunicate data in the southern California region, centralized close to Los Angeles and San Diego from Lambert and Lambert (2006) where the presence and abundance of individuals of 14 tunicate species are observed in 27 regions over a period between 1994 - 2000. The details in how this data was prepared will be noted later on this page under "Data Preparation"
The following are three maps with each of the species, and a map showing all of the species as one layer.




For British Columbia, Cathryn's tunicate data came from personal sightings using GPS and would likely be a more accurate assessment:

Invasive tunicate sightings in BC, Murray 2008
Independent Variable data
- California Coastline
- Major ports and their traffic volume by tonneage
This was assessed by noting all the major ports of California that have industrial trade go through, and then downloading the data set which contains their yearly trade levels. We may infer the larger the trade tonneage, the more ships go through, or that the vessels are larger. In either case, more ballast water would be carried through which may have a positive correlation to invasive tunicates. Their locations are then geocoded with google map.
The list of ports and their traffic volumes can be seen here.
- Population density polygons of California.
This was sourced from a layer off the Geography Network, which was saved as a local copy and used for analysis. The density is not a gradient but a series of polygons, hence the changes in values spatially can be very sharp from one polygon to another.

- Bays & Marinas
No data set was located for all of California, hence this data was found from various web sites that list the names and locations of various ports and marinas. For this study I separated "bays" as generally smaller traffic regions where boats may occasionally dock; these are areas that can support small vessel traffic and would have boats docking every so often.
Note these distinctions are vague and there is a large amount of overlap between the two layers. Their separation is based strictly on intuition and guesswork.
- BC Marinas: Separated by Anchorages, Marinas, Small Crafts and Moorages
After inputting all of the data into a layer, we produce a few regional maps just to get a sense of the study area and the presence of the factors we are analyzing:

Major Ports in California
California with marinas, seaports, population and invasive presence data
Upon visual inspection, it already seems as though the aquatic invasives (seen in red) are clustered in three distinct regions with varying concentrations.
In British Columbia, the following data set was produced

British Columbia's boats, moorages and anchorages. Murray, 2008
Data Preparation
The next step is to set up transects with which we can analyze the invasive species.
First, the coast line of California is smoothed in ArcMap to remove noise and excessive curvatures of the landscape, creating a more linear data set for easier analysis.
Next, the coastline is converted into a raster
of 5km², which is subsequently converted into 5km²
polygons where each square serves as an individual transect with which
attributes can be joined onto. We then join each of these transects with attributes that appear along the transect:
- Ciona sightings (point data, # per transect)
- Styela sightings (point data, # per transect)
- Didemnum sightings (point data, # per transect)
- Tunicate sightings (point data, Ciona + Styela + Didemnum, # per transect)
- Marinas (point data, # per transect)
- Bays (point data, # per transect)
- Ports with traffic (point data, volume of annual trade)
- Population (Seen as a relative value - given scores of 0-10 for
each polygon. Whichever polygon is closest to the transect will give
its respective relative population to the transect)
We note that not all of the data can be included; with the exception of the population polygons, the point data has to fall within the transect to be part of it. Fortunately, the vast majority of the points do meet this criteria, for example when we overlay aquatic invasive sightings over the transects:

Data preparation for the second data set involves looking at the total number of observations and the number of species per year, and then averaging the changes in the number from 1994 to 2000. By aggregating these overall changes in both species richness and abundance, we can get a relative value for how both these factors have changed over the six years of time for each of the 27 sites. Details of the Excel file used to calculate this data can be downloaded here.
These were then plotted into ArcGIS as follows (click for enlarged images)

Tunicate Biodiversity values

Tunicate Abundance values
Data Analysis (Results on the next page)
Cluster Analyses
This tests if there is any significant clustering. There are three analyses we run in ArcGIS:
- Spatial Statistics: HotSpot Analyses with rendering.
HotSpot
analysis tells us whether features with
high or low values would cluster. If one feature is high and the
neighbor
values are all high, then it is a hotspot. When the local summaries are
much different
than expected, we have statistically significant hotspot.
- Multi-Distance Spatial Cluster Analyses
This summarizes spatial dependence over a range of distances, illustrating how spatial clustering changes as neighbourhood sizes change.
- Spatial Autocorrelation: Moran's I index
In the British Columbia data set, the cluster analyses methodology differs, consisting of:
- Nearest Neighbor Hierarchical analysis
- Single kernel density analysis with
- Dual kernel density analysis
Comparing the results of these analyses, accomplished with CrimeStats, with the results of the California analyses which was performed with GWR, we should be able to distinguish the strengths and differences of these methods, as well as revealing the different patterns of tunicate abundance in the two regions.
Regression Analyses
Geographically Weighted Regression
We now can run a Geographically Weighted Regression with the GWR software, as we have concluded there is clustering of the invasive species.
To prepare the data, we convert the transect polygon with the joined attributes into points, use Hawth's Tools to add X and Y coordinates to each transect to correspond with latitude and longitude and then export the file to a .dat file (renamed from .dbf _> .csv in Excel) to run in GWR.
The first GWR model will run tunicate invasives as the dependent model, with marinas, bays, population, port volume and coastline length as the independent factors that may be correlated with the spatial properties of tunicates.
The parameters used are as follows:
-Adaptive Kernel
-Spherical (Lat/Long)
-
-Gaussian
Note: Poisson is
preferred, as it works for
count data whereas Gaussian is for continuous data. However, a Poisson analysis
was impossible due to data limitations and the limitations with how the
program calculates GWR. This will be discussed further in the
discussion.
A second GWR model was run, where the same parameters were kept but ports with industrial trade were excluded. This is because there are very few transects with ports in the data set, and hence it was thought that it may skew the results by giving too much weight to the few transects where ports exist.
A third GWR model was run with the same parameters as the first, but we used a Logistic model instead of a Gaussian model and replaced the number of tunicates with a binary value that indicates presence or absence in a transect.
A fourth GWR model was finally run, which replaces the dependent variable of tunicates with Ciona, Didemnum or Styela observations alone without joining the others. The independent variables for each were the same as the first model, with the exception that either Ciona, Didemnum or Styela observations was added to test for interspecies effects. For example, if the dependent variable was Ciona, the independent variables would include Didemnum and Styela.
Linear Regression models
To compare the results with GWR, linear regression models were also run with both the first coast line dataset as well as the second southern California dataset. Various regressions were run which involved relating Invasives with the differing independent factors in the first set and abundance and biodiversity with different independent factors in the second set of data.
