Fitting Models and Stochasticity

EFB 370: Recitation II.2

Dr. Gurarie and Colton Moyer

2024-02-22

There are THREE exercises in this lab, set aside in green boxes like this one. Two involve R, and one involves a seperate shiny app (link below) To work through Exercise 1, you’ll need the same files as last week: SeaLions.csv and SeaOtters.csv (also available on Blackboard). For Exercise 2, you will also need to download BC_SeaOtters.csv.

FOR SUBMISSION Please Upload a word document with answers to: Exercise 1: parts 1-3; Exercise 2: part 3; Exercise 3: parts 1 and 2 in the Recitation 5 (R pt 2) on Blackboard by midnight, Friday, Feb 23.

R Project digression: For importing data “best practice” would be to create a new R project in its own file folder and then move data files (.csv‘s) into that folder. R looks for files in it’s project’s directory. This is your projects whole world. If you ask it to find a file located in ’Downloads’, R will say something like no such file exists in working directory.

For loading data, there are several options. The cleanest is probably to use the following script

SeaLions <- read.csv("SeaLions.csv")
SeaOtters <- read.csv("SeaOtters.csv")

But that will work only if you are in the right directory! That is if the working directory, given by:

getwd()
## [1] "C:/Users/Elie/teaching/EFB370/labs/labII.2"

Here are all the files in my working directory:

list.files()
## [1] "BC_SeaOtters.csv"               "RandomnessAndLinearModels.html"
## [3] "RandomnessAndLinearModels.Rmd"  "SeaLions.csv"                  
## [5] "SeaOtters.csv"

Is the same as the location of the files. More details available upon request!

An alternative point-and-click way to import data is by going to File>Import Dataset>From Text (base) and navigate to the the files in your library (they are probably in your downloads folder). Then click Import.

This will create objects in your Rstudio Environment (to the right>>>) Called SeaOtters and SeaLions. You can also view the data by clicking on the tab SeaOtters above^^^.

#head() is a useful command for viewing/understanding data structure. This will show you the first 6 rows of a dataframe.
head(SeaLions)
##     Island Weight Length Sex
## 1 Chirpoev   15.5     93   M
## 2 Chirpoev   29.0    106   F
## 3 Chirpoev   35.5    112   M
## 4 Chirpoev   32.0    107   M
## 5 Chirpoev   32.0    105   M
## 6 Chirpoev   33.5    111   M

There are columns called Island, Weight, Length, and Sex in SeaLions

Part I: Fitting linear models

Sea Lion Body Sizes

You can extract the length and weight of the sea lion pups and plot the Weight against Length as follows:

#This creates new strings for columns in SeaLions. See environment to the right>>>
Length <- SeaLions$Length
Weight <- SeaLions$Weight
plot(Length, Weight)

But, a better, cleaner way to do this is with the “~” (tilde) syntax. Here we’re putting the “response” variable (Weight) against the “predictor” (Length), and the data frame is SeaLions:

plot(Weight ~ Length, data = SeaLions)

NOTE: You can skip the section below as it is non-essential. But arguably “fun”.

ggplot and installing packages

Or, if you like, you can use the fancy-shmancy ggplot function to make plots. But you have to install a new “library” called ggplot2. To install a library, you use the install.packages() function:

install.packages("ggplot2")

You only need to install a package once on your computer. After that, it will always be there.

Alternatively, you can use the R interface to install new packages. To do so:

  • Click the Packages tab in the bottom right window of Rstudio and click Install
  • … or … go to Tools > Ìnstall Packages in the menu bar.
  • In the blank “packages” line, start typing the name of the package you want. RStudio should automatically fill this in with suggestions. Use a comma to separate the names of multiple package if installing more than one simultaneously.
  • Click Install. This will automatically put the correct code in the console for you.

From now on, whenever you need that package loaded you use the library function. You only have to do this once each time you open R, after that, a package stays in your library for the rest of your session.

library(ggplot2)

If you receive prewritten code from someone and open the file in RStudio, the software will automatically identify packages called by the library funtion. If you do not have these packages installed, Rstudio will ask you if you’d like to install them, which can save you time. However, you will still need to run the library function on each package before you can use them.

Now - finally - you have access to the funky / fancy library of ggplotting functions.

require(ggplot2)

The ggplot function allows for some more flexible plotting, but (at least initialy) with more complicated syntax. For example, the equivalent of the plot above is:

ggplot(SeaLions, aes(x = Length, y = Weight)) + geom_point()

Now it is easy to, for example, add sex:

ggplot(SeaLions, aes(x = Length, y = Weight, color = Sex)) + geom_point()

And to do other nifty tricks. But for now, sticking to the so-called base plots is easier.

Fitting a linear model

Look closely at the plot command above:

plot(Weight ~ Length, data = SeaLions)

This is the exact syntax you need to fit a linear model with the lm() function, which fits a linear regression. That means it finds the “best” values for a model that looks likes this:

\[Y_i = \alpha + \beta\, X_i + \epsilon_i\]

Where \(Y\) is the response variable, \(X\) is the predictor, \(\alpha\) is the intercept, \(\beta\) is the slope, and \(\epsilon\) is the “residual” variation, i.e. a random variable: \({\cal N}(0, \sigma^2)\).

lm(Weight ~ Length, data = SeaLions)
## 
## Call:
## lm(formula = Weight ~ Length, data = SeaLions)
## 
## Coefficients:
## (Intercept)       Length  
##    -59.5108       0.8468

If you just run this, you will see the output, which is just the estimate of the intercept and the slope (i.e. the Length effect). Think about how to interpret these numbers.

However, there is actually a lot more information in this linear model fit. But to get to it, you need to save the linear model output into a new object. Thus:

sealion.fit <- lm(Weight ~ Length, data = SeaLions)

Now you can learn more by asking R to give you a summary:

summary(sealion.fit)
## 
## Call:
## lm(formula = Weight ~ Length, data = SeaLions)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6832 -1.7238  0.0668  1.9294  7.7447 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -59.51078    2.22651  -26.73   <2e-16 ***
## Length        0.84685    0.02024   41.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.663 on 496 degrees of freedom
## Multiple R-squared:  0.7792, Adjusted R-squared:  0.7788 
## F-statistic:  1750 on 1 and 496 DF,  p-value: < 2.2e-16

Most important, the standard errors, p-values (especially on the slope) and R^2 values.

There are usually a few questions in a statistical model.

  1. Is there a relationship between particular variables? The answer to that is in the p-value (is it less than 0.05). In this case YES IT IS! (Note: 2e-16 means \(2 \times 10^{-16}\), so that’s the probability that we would see a slope that strong if there were no relationship).

  2. How good is the fitted model? This is provided by the \(R^2\) value, which in this case is 0.7 (1.0 is a perfect fit). Specifically it is the proportion of the variance explained by the model.

  3. Finally, if there IS a significant relationship, how strong is that relationship? For that we need the point estimate on the Length effect (0.846 kg/cm) AND a 95% confidence interval. To get a 95% Confidence Intervals you need to add and subtract 2 x S.E. to the Estimate. Here, that’s probably easiest to do by “hand”, i.e.:

(CI.low <- 0.84675 - 2*0.02024)
## [1] 0.80627
(CI.high <- 0.84675 + 2*0.02024)
## [1] 0.88723

So the point estimate and 95% CI are 0.847 (0.806, 0.887).

You can also extract those numbers from the linear model fit as follows:

summary(sealion.fit)$coef[2,1] + c(-2,2) * summary(sealion.fit)$coef[2,2]
## [1] 0.8063681 0.8873311

Plotting the fit

If you have a linear model, you can easily plot the fit:

plot(Weight ~ Length, data = SeaLions)
abline(sealion.fit)

Note, abline is a function that adds an intercept-slope line to any existing plot (“a” - intercept, “b” - slope).

Here are some ways to make this a bit prettier:

plot(Weight ~ Length, data = SeaLions, pch = 19, col = rgb(0,0,0,.2))
abline(sealion.fit, col = "red", lwd = 2)

The pch = 19 makes filled dots, col = rgb(0,0,0,.2) is a way to make transparent black colors, lwd = 2 is the line thickness, col = "red".

Lab Exercise 1

You can subset the data into males and females with the following code:

SeaLion.males <- subset(SeaLions, Sex == "M")
SeaLion.females <- subset(SeaLions, Sex == "F")

Use these subsetted data and:

  1. Plot the relationship between Weight and Length for each of males and females
  2. Fit a model and add the regression line to the plot for males/females.
  3. Report the estimate at 95% C.I. of the slope coefficient for males and females. Based on these results, do they have different growth curves?

Part II: Linear models for exponential growth (WHAT!?)

Load the sea otter data and plot it:

plot(count ~ year, data = SeaOtters)

It is quite easy to make plots on a log scale, by the way, which can be useful:

plot(count ~ year, data = SeaOtters, log = "y")

Simple linear Model

otter.fit <- lm(count ~ year, data = SeaOtters)

Note, we take a “model fit” (which is a complicated thing) and put it into a new object with name Model1. The output of Model1 is just the estimated parameters for the intercept and the “Year effect” (i.e. slope of the curve).

otter.fit
## 
## Call:
## lm(formula = count ~ year, data = SeaOtters)
## 
## Coefficients:
## (Intercept)         year  
##   -65751.54        33.19

Again, there is a lot more information if you ask for a summary:

summary(otter.fit)
## 
## Call:
## lm(formula = count ~ year, data = SeaOtters)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -152.74  -78.18  -58.79   34.71  417.48 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -65751.54    6158.81  -10.68 2.19e-10 ***
## year            33.19       3.08   10.78 1.83e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 142.4 on 23 degrees of freedom
## Multiple R-squared:  0.8347, Adjusted R-squared:  0.8275 
## F-statistic: 116.1 on 1 and 23 DF,  p-value: 1.83e-10

Of importance to us is the Estimate (same as above) and the Std. Error which quantifies how precise our estimate is. Two standard errors is (roughly) the same as a 95% Confidence Interval. Thus, the slope on this fit is 33.2 new sea otters / year ± 6.18. This is definitely bigger than 0! We can also see that because the p-value (under column Pr(>|t|)) is very very very very small.

We can plot this linear model on our data:

plot(count ~ year, data = SeaOtters)
abline(otter.fit)

Log Transformation

We kind of knew the growth was not linear (otherwise this week’s topic wouldn’t have been called Exponential Growth). But how can we use a linear model to fit an exponential growth function?

With a simple transformation!

\[\log(Y_i) = \alpha + \beta\, X_i + \epsilon_i\]

How does this relate to an exponential growth model?

\[e^{\log(Y_i)} = e^{\alpha + \beta\, X_i + \epsilon_i}\] \[Y_t = N_0 e^{\beta\, X_t + \epsilon_t}\]

We just replaced \(e^\alpha\) with \(N_0\), and \(i\) with \(t\). But \(\beta\) is EXACTLY the growth rate \(r\) (and \(e^\beta = \lambda\)). The \(\epsilon_i\) hanging on the end is a bit of stochasticity.

Anyways, doing this with our linear modeling tools is easy. We just wrap count in a log.

logotter.fit <- lm(log(count) ~ year, data = SeaOtters)
summary(logotter.fit)
## 
## Call:
## lm(formula = log(count) ~ year, data = SeaOtters)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.191084 -0.062944 -0.005104  0.055518  0.231704 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.402e+02  4.732e+00  -29.63   <2e-16 ***
## year         7.325e-02  2.367e-03   30.95   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1094 on 23 degrees of freedom
## Multiple R-squared:  0.9766, Adjusted R-squared:  0.9755 
## F-statistic: 958.1 on 1 and 23 DF,  p-value: < 2.2e-16

Note how our statistics changed. Specifically: our \(\beta\) estimate (which is also \(r\)) is now 0.07325, which is a good estimate of the intrinsic growth rate. Even better, we know have a standard error around that estimate: 0.00237, which means our growth rate can be written as: \(\widehat{r} = (7.32 \pm 0.47)\times 10^{-2}\). We can convert that to an annual growth rate in R as well (do NOT worry about understanding the following code:

r.hat <- summary(logotter.fit)$coefficients[2,1]
r.sd <- summary(logotter.fit)$coefficients[2,2]
exp(c(r.hat, r.hat-2*r.sd, r.hat+2*r.sd))
## [1] 1.076001 1.070920 1.081106

So our estimate of annual population growth is 7.6% (95% Confidence Interval 7.1%-8.1%). Note how much higher the \(R^2\) value is! This is (by that measure also) a much better fit.

To visualize the fit:

plot(log(count) ~ year, data = SeaOtters)
abline(logotter.fit)

Much better!

Let’s make our final graph a little nicer

plot(log(count) ~ year, main = 'Sea Otter Population', 
     col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
abline(logotter.fit, col = "blue", lwd = 2)

The commands we’re using within the plot function include setting the title of the graph (main =), changing the color of the symbols and the line (col =), and changing the shape of the symbols (pch =). There also many more alterations you can make to get your graph perfect - but that can be a deep dark Rabbit hole!

Finally, we can also draw the fit on non-log transformed data as follows:

plot(count ~ year, main = 'Sea Otter Population', 
     col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
curve(exp(logotter.fit$coef[1] + 
            logotter.fit$coef[2] * x), col = "blue", add = TRUE, lwd = 2)

Lab Exercise 2

In the next homework assignment (due next Thursday, March 2) you will be asked to analyze the population growth rate of sea otters in British Columbia, the population just to the north of Washington state. The data are here in a file called BC_SeaOtters.

For this exercise all you need to do is (1) Load the data, (2) plot the data, and (3) report how many were released, in what year, how many are there in the final year of data.

In the homework, you will estimate the growth rate with confidence intervals as well, but you don’t need to worry about that for the lab.

NOTE: You can skip the section below as it is non-essential. But arguably “fun”.

More ggplot plotting

It can be a bit fussy to plot the confidence interval around our prediction. Below, we include a little snippet of code which does just that using ggplot. You do not need to understand this code (there’s actually quite a bit going on to get it to work right), but the output is pretty.

require(ggplot2)

ggplot(SeaOtters, aes(year, count)) + geom_point() + 
    stat_smooth(method = "glm", 
                method.args = list(family = gaussian(link = "log")))
## `geom_smooth()` using formula = 'y ~ x'

This really looks like a nice fit!

Lab Exercise 3

The final exercise is NOT in R! It is just some practice experimneting with the stochastic population simulator here: https://egurarie.shinyapps.io/StochasticGrowth. The goal will be to see how sensitive the sea otter success is to the initial population released.

  1. In the Demographic Stochasticity simulator, change the probability of giving birth to 0.3, and the probability of dying at 0.2 (these numbers approximate sea otter population growth). Set \(t_{max}\) to 40 years. Increase the number of simulations to 1000.
    Report the probability of extinction due to demographic stochasticity and the final median population size if only (a) 2, (b) 10 and (c) 60 sea otters were released.

  2. In the Environmental Stochasticity simulater, set the mean population growth rate (\(\mu_r\)) to 1% (i.e. 0.01), the \(t_{max}\) to 40, the number of otters released (\(N_0\)) to 60, and the number of simulations to 1000. Again, this (very) roughly simulates sea otter population growth.
    Report the probability of extinction due to environmental stochasticity and the final median population size if the standard deviation on the growth rate (\(\sigma_r\)) varies between: (a) 0.05, (b) 0.1, (c) 1, and (d) 2.