There are THREE exercises in this lab, set aside in green boxes like this one. Two involve R, and one involves a seperate shiny app (link below) To work through Exercise 1, you’ll need the same files as last week:
SeaLions.csv
andSeaOtters.csv
(also available on Blackboard). For Exercise 2, you will also need to downloadBC_SeaOtters.csv
.
FOR SUBMISSION Please Upload a word document with answers to: Exercise 1: parts 1-3; Exercise 2: part 3; Exercise 3: parts 1 and 2 in the Recitation 5 (R pt 2) on Blackboard by midnight, Friday, Feb 23.
R Project digression: For
importing data “best practice” would be to create a new R project in its
own file folder and then move data files (.csv
‘s) into that
folder. R looks for files in it’s project’s directory. This is your
projects whole world. If you ask it to find a file located in
’Downloads’, R will say something like
no such file exists in working directory
.
For loading data, there are several options. The cleanest is probably to use the following script
But that will work only if you are in the right directory! That is if the working directory, given by:
## [1] "C:/Users/Elie/teaching/EFB370/labs/labII.2"
Here are all the files in my working directory:
## [1] "BC_SeaOtters.csv" "RandomnessAndLinearModels.html"
## [3] "RandomnessAndLinearModels.Rmd" "SeaLions.csv"
## [5] "SeaOtters.csv"
Is the same as the location of the files. More details available upon request!
An alternative point-and-click way to import data is by going to
File>Import Dataset>From Text (base)
and navigate to
the the files in your library (they are probably in your downloads
folder). Then click Import
.
This will create objects in your Rstudio Environment (to the
right>>>) Called SeaOtters
and
SeaLions
. You can also view the data by clicking on the tab
SeaOtters
above^^^.
#head() is a useful command for viewing/understanding data structure. This will show you the first 6 rows of a dataframe.
head(SeaLions)
## Island Weight Length Sex
## 1 Chirpoev 15.5 93 M
## 2 Chirpoev 29.0 106 F
## 3 Chirpoev 35.5 112 M
## 4 Chirpoev 32.0 107 M
## 5 Chirpoev 32.0 105 M
## 6 Chirpoev 33.5 111 M
There are columns called Island
, Weight
,
Length
, and Sex
in SeaLions
Part I: Fitting linear models
Sea Lion Body Sizes
You can extract the length and weight of the sea lion pups and plot the Weight against Length as follows:
#This creates new strings for columns in SeaLions. See environment to the right>>>
Length <- SeaLions$Length
Weight <- SeaLions$Weight
plot(Length, Weight)
But, a better, cleaner way to do this is with the
“~” (tilde) syntax. Here we’re putting the “response” variable (Weight)
against the “predictor” (Length), and the data frame is
SeaLions
:
NOTE: You can skip the section below as it is non-essential. But arguably “fun”.
ggplot
and installing packages
Or, if you like, you can use the fancy-shmancy ggplot
function to make plots. But you have to install a new “library” called
ggplot2
. To install a library, you use the
install.packages()
function:
You only need to install a package once on your computer. After that, it will always be there.
Alternatively, you can use the R
interface to install
new packages. To do so:
- Click the
Packages
tab in the bottom right window ofRstudio
and clickInstall
- … or … go to
Tools > Ìnstall Packages
in the menu bar. - In the blank “packages” line, start typing the name of the package
you want.
RStudio
should automatically fill this in with suggestions. Use a comma to separate the names of multiple package if installing more than one simultaneously. - Click
Install
. This will automatically put the correct code in the console for you.
From now on, whenever you need that package loaded you use the
library
function. You only have to do this once each time
you open R
, after that, a package stays in your library for
the rest of your session.
If you receive prewritten code from someone and open the file in
RStudio
, the software will automatically identify packages
called by the library
funtion. If you do not have these
packages installed, Rstudio
will ask you if you’d like to
install them, which can save you time. However, you will still need to
run the library
function on each package before you can use
them.
Now - finally - you have access to the funky / fancy library of ggplotting functions.
The ggplot function allows for some more flexible plotting, but (at least initialy) with more complicated syntax. For example, the equivalent of the plot above is:
Now it is easy to, for example, add sex:
And to do other nifty tricks. But for now, sticking to the so-called base plots is easier.
Fitting a linear model
Look closely at the plot
command above:
This is the exact syntax you need to fit a linear model with
the lm()
function, which fits a linear regression. That
means it finds the “best” values for a model that looks likes this:
\[Y_i = \alpha + \beta\, X_i + \epsilon_i\]
Where \(Y\) is the response variable, \(X\) is the predictor, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon\) is the “residual” variation, i.e. a random variable: \({\cal N}(0, \sigma^2)\).
##
## Call:
## lm(formula = Weight ~ Length, data = SeaLions)
##
## Coefficients:
## (Intercept) Length
## -59.5108 0.8468
If you just run this, you will see the output, which is just the estimate of the intercept and the slope (i.e. the Length effect). Think about how to interpret these numbers.
However, there is actually a lot more information in this linear model fit. But to get to it, you need to save the linear model output into a new object. Thus:
Now you can learn more by asking R to give you a summary:
##
## Call:
## lm(formula = Weight ~ Length, data = SeaLions)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6832 -1.7238 0.0668 1.9294 7.7447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -59.51078 2.22651 -26.73 <2e-16 ***
## Length 0.84685 0.02024 41.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.663 on 496 degrees of freedom
## Multiple R-squared: 0.7792, Adjusted R-squared: 0.7788
## F-statistic: 1750 on 1 and 496 DF, p-value: < 2.2e-16
Most important, the standard errors, p-values (especially on the slope) and R^2 values.
There are usually a few questions in a statistical model.
Is there a relationship between particular variables? The answer to that is in the p-value (is it less than 0.05). In this case YES IT IS! (Note: 2e-16 means \(2 \times 10^{-16}\), so that’s the probability that we would see a slope that strong if there were no relationship).
How good is the fitted model? This is provided by the \(R^2\) value, which in this case is 0.7 (1.0 is a perfect fit). Specifically it is the proportion of the variance explained by the model.
Finally, if there IS a significant relationship, how strong is that relationship? For that we need the point estimate on the Length effect (0.846 kg/cm) AND a 95% confidence interval. To get a 95% Confidence Intervals you need to add and subtract 2 x S.E. to the Estimate. Here, that’s probably easiest to do by “hand”, i.e.:
## [1] 0.80627
## [1] 0.88723
So the point estimate and 95% CI are 0.847 (0.806, 0.887).
You can also extract those numbers from the linear model fit as follows:
## [1] 0.8063681 0.8873311
Plotting the fit
If you have a linear model, you can easily plot the fit:
Note, abline
is a function that adds an intercept-slope
line to any existing plot (“a” - intercept, “b” - slope).
Here are some ways to make this a bit prettier:
plot(Weight ~ Length, data = SeaLions, pch = 19, col = rgb(0,0,0,.2))
abline(sealion.fit, col = "red", lwd = 2)
The pch = 19
makes filled dots,
col = rgb(0,0,0,.2)
is a way to make transparent black
colors, lwd = 2
is the line thickness,
col = "red"
.
Lab Exercise 1
You can subset the data into males and females with the following code:
Use these subsetted data and:
- Plot the relationship between Weight and Length for each of males and females
- Fit a model and add the regression line to the plot for males/females.
- Report the estimate at 95% C.I. of the slope coefficient for males and females. Based on these results, do they have different growth curves?
Part II: Linear models for exponential growth (WHAT!?)
Load the sea otter data and plot it:
It is quite easy to make plots on a log scale, by the way, which can be useful:
Simple linear Model
Note, we take a “model fit” (which is a complicated thing) and put it
into a new object with name Model1
. The output of
Model1
is just the estimated parameters for the intercept
and the “Year effect” (i.e. slope of the curve).
##
## Call:
## lm(formula = count ~ year, data = SeaOtters)
##
## Coefficients:
## (Intercept) year
## -65751.54 33.19
Again, there is a lot more information if you ask for a summary:
##
## Call:
## lm(formula = count ~ year, data = SeaOtters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.74 -78.18 -58.79 34.71 417.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65751.54 6158.81 -10.68 2.19e-10 ***
## year 33.19 3.08 10.78 1.83e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.4 on 23 degrees of freedom
## Multiple R-squared: 0.8347, Adjusted R-squared: 0.8275
## F-statistic: 116.1 on 1 and 23 DF, p-value: 1.83e-10
Of importance to us is the Estimate
(same as above) and
the Std. Error
which quantifies how precise our estimate
is. Two standard errors is (roughly) the same as a 95%
Confidence Interval. Thus, the slope on this fit is 33.2 new
sea otters / year ± 6.18. This is definitely bigger than 0! We can also
see that because the p-value (under column
Pr(>|t|)
) is very very very very small.
We can plot this linear model on our data:
Log Transformation
We kind of knew the growth was not linear (otherwise this week’s topic wouldn’t have been called Exponential Growth). But how can we use a linear model to fit an exponential growth function?
With a simple transformation!
\[\log(Y_i) = \alpha + \beta\, X_i + \epsilon_i\]
How does this relate to an exponential growth model?
\[e^{\log(Y_i)} = e^{\alpha + \beta\, X_i + \epsilon_i}\] \[Y_t = N_0 e^{\beta\, X_t + \epsilon_t}\]
We just replaced \(e^\alpha\) with \(N_0\), and \(i\) with \(t\). But \(\beta\) is EXACTLY the growth rate \(r\) (and \(e^\beta = \lambda\)). The \(\epsilon_i\) hanging on the end is a bit of stochasticity.
Anyways, doing this with our linear modeling tools is easy. We just
wrap count
in a log.
##
## Call:
## lm(formula = log(count) ~ year, data = SeaOtters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.191084 -0.062944 -0.005104 0.055518 0.231704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.402e+02 4.732e+00 -29.63 <2e-16 ***
## year 7.325e-02 2.367e-03 30.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1094 on 23 degrees of freedom
## Multiple R-squared: 0.9766, Adjusted R-squared: 0.9755
## F-statistic: 958.1 on 1 and 23 DF, p-value: < 2.2e-16
Note how our statistics changed. Specifically: our \(\beta\) estimate (which is also \(r\)) is now 0.07325, which is a good estimate of the intrinsic growth rate. Even better, we know have a standard error around that estimate: 0.00237, which means our growth rate can be written as: \(\widehat{r} = (7.32 \pm 0.47)\times 10^{-2}\). We can convert that to an annual growth rate in R as well (do NOT worry about understanding the following code:
r.hat <- summary(logotter.fit)$coefficients[2,1]
r.sd <- summary(logotter.fit)$coefficients[2,2]
exp(c(r.hat, r.hat-2*r.sd, r.hat+2*r.sd))
## [1] 1.076001 1.070920 1.081106
So our estimate of annual population growth is 7.6% (95% Confidence Interval 7.1%-8.1%). Note how much higher the \(R^2\) value is! This is (by that measure also) a much better fit.
To visualize the fit:
Much better!
Let’s make our final graph a little nicer
plot(log(count) ~ year, main = 'Sea Otter Population',
col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
abline(logotter.fit, col = "blue", lwd = 2)
The commands we’re using within the plot
function
include setting the title of the graph (main =
), changing
the color of the symbols and the line (col =
), and changing
the shape of the symbols (pch =
). There also many more
alterations you can make to get your graph perfect - but that can be a
deep dark R
abbit hole!
Finally, we can also draw the fit on non-log transformed data as follows:
plot(count ~ year, main = 'Sea Otter Population',
col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
curve(exp(logotter.fit$coef[1] +
logotter.fit$coef[2] * x), col = "blue", add = TRUE, lwd = 2)
Lab Exercise 2
In the next homework assignment (due next Thursday, March 2) you will be asked to analyze the population growth rate of sea otters in British Columbia, the population just to the north of Washington state. The data are here in a file called
BC_SeaOtters
.For this exercise all you need to do is (1) Load the data, (2) plot the data, and (3) report how many were released, in what year, how many are there in the final year of data.
In the homework, you will estimate the growth rate with confidence intervals as well, but you don’t need to worry about that for the lab.
NOTE: You can skip the section below as it is non-essential. But arguably “fun”.
More ggplot
plotting
It can be a bit fussy to plot the confidence interval around our
prediction. Below, we include a little snippet of code which does just
that using ggplot
. You do not need to understand this code
(there’s actually quite a bit going on to get it to work right), but the
output is pretty.
require(ggplot2)
ggplot(SeaOtters, aes(year, count)) + geom_point() +
stat_smooth(method = "glm",
method.args = list(family = gaussian(link = "log")))
## `geom_smooth()` using formula = 'y ~ x'
This really looks like a nice fit!
Lab Exercise 3
The final exercise is NOT in R! It is just some practice experimneting with the stochastic population simulator here: https://egurarie.shinyapps.io/StochasticGrowth. The goal will be to see how sensitive the sea otter success is to the initial population released.
In the Demographic Stochasticity simulator, change the probability of giving birth to 0.3, and the probability of dying at 0.2 (these numbers approximate sea otter population growth). Set \(t_{max}\) to 40 years. Increase the number of simulations to 1000.
Report the probability of extinction due to demographic stochasticity and the final median population size if only (a) 2, (b) 10 and (c) 60 sea otters were released.In the Environmental Stochasticity simulater, set the mean population growth rate (\(\mu_r\)) to 1% (i.e. 0.01), the \(t_{max}\) to 40, the number of otters released (\(N_0\)) to 60, and the number of simulations to 1000. Again, this (very) roughly simulates sea otter population growth.
Report the probability of extinction due to environmental stochasticity and the final median population size if the standard deviation on the growth rate (\(\sigma_r\)) varies between: (a) 0.05, (b) 0.1, (c) 1, and (d) 2.