There are TWO exercises in this lab, set aside in green boxes like this one. To work through the exercises in this lab, you’ll need to download the SeaLions.csv and SeaOtters.csv datasets as last week by following the links (also available on Blackboard). For Exercise 2, you will also need to download BC_SeaOtters.csv. If you’ve forgotten how, follow the instructions from the previous lab.

SeaOtters <- read.csv("SeaOtters.csv")
SeaLions <- read.csv("SeaLions.csv")

Part I: Fitting linear models

Sea Lion Body Sizes

You can extract the length and weight of the sea lion pups and plot the Weight against Length as follows:

Length <- SeaLions$Length
Weight <- SeaLions$Weight
plot(Length, Weight)

But, a better, cleaner way to do this is with the “~” (tilde) syntax. Here we’re putting the “response” variable (Weight) against the “predictor” (Length), and the data frame is SeaLions:

plot(Weight ~ Length, data = SeaLions)

NOTE: You can skip the section below as it is non-essential. But arguably “fun”.

ggplot and installing packages

Or, if you like, you can use the fancy-shmancy ggplot function to make plots. But you have to install a new “library” called ggplot2. To install a library, you use the install.packages() function:

install.packages("ggplot2")

You only need to install a package once on your computer. After that, it will always be there.

Alternatively, you can use the R interface to install new packages. To do so:

  • Click the Packages tab in the bottom right window of Rstudio and click Install
  • … or … go to Tools > Ìnstall Packages in the menu bar.
  • In the blank “packages” line, start typing the name of the package you want. RStudio should automatically fill this in with suggestions. Use a comma to separate the names of multiple package if installing more than one simultaneously.
  • Click Install. This will automatically put the correct code in the console for you.

From now on, whenever you need that package loaded you use the library function. You only have to do this once each time you open R, after that, a package stays in your library for the rest of your session.

library(ggplot2)

If you receive prewritten code from someone and open the file in RStudio, the software will automatically identify packages called by the library funtion. If you do not have these packages installed, Rstudio will ask you if you’d like to install them, which can save you time. However, you will still need to run the library function on each package before you can use them.

Now - finally - you have access to the funky / fancy library of ggplotting functions.

require(ggplot2)

The ggplot function allows for some more flexible plotting, but (at least initialy) with more complicated syntax. For example, the equivalent of the plot above is:

ggplot(SeaLions, aes(x = Length, y = Weight)) + geom_point()

Now it is easy to, for example, add sex:

ggplot(SeaLions, aes(x = Length, y = Weight, color = Sex)) + geom_point()

And to do other nifty tricks. But for now, sticking to the so-called base plots is easier.

Fitting a linear model

Look closely at the plot command above:

plot(Weight ~ Length, data = SeaLions)

This is the exact syntax you need to fit a linear model with the lm() function, which fits a linear regression. That means it finds the “best” values for a model that looks likes this:

\[Y_i = \alpha + \beta\, X_i + \epsilon_i\]

Where \(Y\) is the response variable, \(X\) is the predictor, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon\) is the “residual” variation, i.e. a random variable: \({\cal N}(0, \sigma^2)\).

lm(Weight ~ Length, data = SeaLions)
## 
## Call:
## lm(formula = Weight ~ Length, data = SeaLions)
## 
## Coefficients:
## (Intercept)       Length  
##    -59.5108       0.8468

If you just run this, you will see the output, which is just the estimate of the intercept and the slope (i.e. the Length effect). Think about how to interpret these numbers.

However, there is actually a lot more information in this linear model fit. But to get to it, you need to save the linear model output into a new object. Thus:

sealion.fit <- lm(Weight ~ Length, data = SeaLions)

Now you can learn more:

summary(sealion.fit)
## 
## Call:
## lm(formula = Weight ~ Length, data = SeaLions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6832 -1.7238  0.0668  1.9294  7.7447 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -59.51078    2.22651  -26.73   <2e-16 ***
## Length        0.84685    0.02024   41.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.663 on 496 degrees of freedom
## Multiple R-squared:  0.7792, Adjusted R-squared:  0.7788 
## F-statistic:  1750 on 1 and 496 DF,  p-value: < 2.2e-16

Most important, the standard errors, p-values (especially on the slope) and R^2 values.

There are usually a few questions in a statistical model.

  1. Is there a relationship between particular variables? The answer to that is in the p-value (is it less than 0.05). In this case YES IT IS! (Note: 2e-16 means \(2 \times 10^{-16}\), so that’s the probability that we would see a slope that strong if there were no relationship).

  2. How good is the fitted model? This is provided by the \(R^2\) value, which in this case is 0.7 (1.0 is a perfect fit). Specifically it is the proportion of the variance explained by the model.

  3. Finally, if there IS a significant relationship, how strong is that relationship? For that we need the point estimate on the Length effect (0.846 kg/cm) AND a 95% confidence interval. To get a 95% Confidence Intervals you need to add and subtract 2 x S.E. to the Estimate. Here, that’s probably easiest to do by “hand”, i.e.:

(CI.low <- 0.84675 - 2*0.02024)
## [1] 0.80627
(CI.high <- 0.84675 + 2*0.02024)
## [1] 0.88723

So the point estimate and 95% CI are 0.847 (0.806, 0.887).

You can also extract those numbers from the linear model fit as follows:

summary(sealion.fit)$coef[2,1] + c(-2,2) * summary(sealion.fit)$coef[2,2]
## [1] 0.8063681 0.8873311

Plotting the fit

If you have a linear model, you can easily plot the fit:

plot(Weight ~ Length, data = SeaLions)
abline(sealion.fit)

Note, abline is a function that adds an intercept-slope line to any existing plot (“a” - intercept, “b” - slope).

Here are some ways to make this a bit prettier:

plot(Weight ~ Length, data = SeaLions, pch = 19, col = rgb(0,0,0,.2))
abline(sealion.fit, col = "red", lwd = 2)

The pch = 19 makes filled dots, col = rgb(0,0,0,.2) is a way to make transparent black colors, lwd = 2 is the line thickness, col = "red".

Lab Exercise 1

You can subset the data into males and females with the following code:

SeaLion.males <- subset(SeaLions, Sex == "M")
SeaLion.females <- subset(SeaLions, Sex == "F")

Use these subsetted data and:

  1. Plot the relationship between Weight and Length for each of males and females
  2. Fit a model and add the regression line to the plot.
  3. Report the estimate at 95% C.I. of the slope coefficient for males and females. Based on these results, do they have different growth curves?

Part II: Linear models for exponential growth (WHAT!?)

Load the sea otter data and plot it:

plot(count ~ year, data = SeaOtters)

It is quite easy to make plots on a log scale, by the way, which can be useful:

plot(count ~ year, data = SeaOtters, log = "y")

Simple linear Model

otter.fit <- lm(count ~ year, data = SeaOtters)

Note, we take a “model fit” (which is a complicated thing) and put it into a new object with name Model1. The output of Model1 is just the estimated parameters for the intercept and the “Year effect” (i.e. slope of the curve).

otter.fit
## 
## Call:
## lm(formula = count ~ year, data = SeaOtters)
## 
## Coefficients:
## (Intercept)         year  
##   -65751.54        33.19

Again, there is a lot more information if you ask for a summary:

summary(otter.fit)
## 
## Call:
## lm(formula = count ~ year, data = SeaOtters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -152.74  -78.18  -58.79   34.71  417.48 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -65751.54    6158.81  -10.68 2.19e-10 ***
## year            33.19       3.08   10.78 1.83e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.4 on 23 degrees of freedom
## Multiple R-squared:  0.8347, Adjusted R-squared:  0.8275 
## F-statistic: 116.1 on 1 and 23 DF,  p-value: 1.83e-10

Of importance to us is the Estimate (same as above) and the Std. Error which quantifies how precise our estimate is. Two standard errors is (roughly) the same as a 95% Confidence Interval. Thus, the slope on this fit is 33.2 new sea otters / year ± 6.18. This is definitely bigger than 0! We can also see that because the p-value (under column Pr(>|t|)) is very very very very small.

We can plot this linear model on our data:

plot(count ~ year, data = SeaOtters)
abline(otter.fit)

Log Transformation

We kind of knew the growth was not linear (otherwise this week’s topic wouldn’t have been called Exponential Growth). But how can we use a linear model to fit an exponential growth function?

With a simple transformation!

\[\log(Y_i) = \alpha + \beta\, X_i + \epsilon_i\]

How does this relate to an exponential growth model?

\[e^{\log(Y_i)} = e^{\alpha + \beta\, X_i + \epsilon_i}\] \[Y_t = N_0 e^{\beta\, X_t + \epsilon_t}\]

We replaced \(e^\alpha\) with \(N_0\), and \(i\) with \(t\). But \(\beta\) is EXACTLY the growth rate \(r\) (and \(e^\beta = \lambda\)). The \(\epsilon_i\) hanging on the end is a bit of stochasticity.

Anyways, doing this with our linear modeling tools is easy. We just wrap count in a log.

logotter.fit <- lm(log(count) ~ year, data = SeaOtters)
summary(logotter.fit)
## 
## Call:
## lm(formula = log(count) ~ year, data = SeaOtters)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191084 -0.062944 -0.005104  0.055518  0.231704 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.402e+02  4.732e+00  -29.63   <2e-16 ***
## year         7.325e-02  2.367e-03   30.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1094 on 23 degrees of freedom
## Multiple R-squared:  0.9766, Adjusted R-squared:  0.9755 
## F-statistic: 958.1 on 1 and 23 DF,  p-value: < 2.2e-16

Note how our statistics changed. Specifically: our \(\beta\) estimate (which is also \(r\)) is now 0.07325, which is a good estimate of the intrinsic growth rate. Even better, we now have a standard error around that estimate: 0.00237, which means our growth rate can be written as: \(\widehat{r} = (7.32 \pm 0.47)\times 10^{-2}\). We can convert that to an annual growth rate in R as well (do NOT worry about understanding the following code:

r.hat <- summary(logotter.fit)$coefficients[2,1]
r.sd <- summary(logotter.fit)$coefficients[2,2]
exp(c(r.hat, r.hat-2*r.sd, r.hat+2*r.sd))
## [1] 1.076001 1.070920 1.081106

So our estimate of annual population growth is 7.6% (95% Confidence Interval 7.1%-8.1%). Note how much higher the \(R^2\) value is! This is (by that measure also) a much better fit.

To visualize the fit:

plot(log(count) ~ year, data = SeaOtters)
abline(logotter.fit)

Much better!

Let’s make our final graph a little nicer

plot(log(count) ~ year, main = 'Sea Otter Population', 
     col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
abline(logotter.fit, col = "blue", lwd = 2)

The commands we’re using within the plot function include setting the title of the graph (main =), changing the color of the symbols and the line (col =), and changing the shape of the symbols (pch =). There also many more alterations you can make to get your graph perfect - but that can be a deep dark Rabbit hole!

Finally, we can also draw the fit on non-log transformed data as follows:

plot(count ~ year, main = 'Sea Otter Population', 
     col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
curve(exp(logotter.fit$coef[1] + 
            logotter.fit$coef[2] * x), col = "blue", add = TRUE, lwd = 2)

Lab Exercise 2

Analyze the population growth rate of sea otters in British Columbia, the population just to the north of Washington state. The data are here in a file called BC_SeaOtters.

  1. Load the data
  2. plot the data,
  3. report how many were released, in what year, how many are there in the final year of data,
  4. estimate the growth rate with confidence intervals.

NOTE: You can skip the section below as it is non-essential. But arguably “fun”.

More ggplot plotting

It can be a bit fussy to plot the confidence interval around our prediction. Below, we include a little snippet of code which does just that using ggplot. You do not need to understand this code (there’s actually quite a bit going on to get it to work right), but the output is pretty.

require(ggplot2)

ggplot(SeaOtters, aes(year, count)) + geom_point() + 
    stat_smooth(method = "glm", 
                method.args = list(family = gaussian(link = "log")))
## `geom_smooth()` using formula = 'y ~ x'

This really looks like a nice fit!