To work through this lab, you’ll need to download the
SeaOtters.csv
dataset by following the links (also available on Blackboard). For the exercise, you’ll also need to downloadBC_SeaOtters.csv
. If you’ve forgotten how, follow the instructions from the previous lab.The TLDR version of this lab is:
lm(log(Y)~X)
will give you the fitted estimates of an exponential model.
SeaOtters <- read.csv("SeaOtters.csv")
Load the sea otter data. It has two columns: year
and
count
:
SeaOtters
and plot the data. There are several ways to do this (as you might remember). ALl the following lines of code to the same thing:
## plot(SeaOtters$year, SeaOtters$count)
## plot(SeaOtters$count~SeaOtters$year)
look closely at the at code and understand the difference in syntax …
basically that plot(X,Y)
is the same as
plot(Y~X)
. Given our goal of modeling, the best version is
this:
plot(count~year, data = SeaOtters)
This says (in English) plot count (response) against year (the predictor) using the data frame SeaOtters.
It is quite easy to make plots on a log scale, by the way, which can be useful:
plot(count ~ year, data = SeaOtters, log = "y")
NOTE: You can skip the section below as it is non-essential. But arguably “fun”.
ggplot
Or, if you like, you can use the fancy-shmancy ggplot
function to make plots. But you have to install a new “library” called
ggplot2
. To install a library, you use the
install.packages()
function:
install.packages("ggplot2")
You only need to install a package once on your computer. After that, it will always be there.
Alternatively, you can use the R
interface to install
new packages. To do so:
Packages
tab in the bottom right window of
Rstudio
and click Install
Tools > Ìnstall Packages
in the menu
bar.RStudio
should automatically fill this in with
suggestions. Use a comma to separate the names of multiple package if
installing more than one simultaneously.Install
. This will automatically put the correct
code in the console for you.From now on, whenever you need that package loaded you use the
library
function. You only have to do this once each time
you open R
, after that, a package stays in your library for
the rest of your session.
library(ggplot2)
If you receive prewritten code from someone and open the file in
RStudio
, the software will automatically identify packages
called by the library
funtion. If you do not have these
packages installed, Rstudio
will ask you if you’d like to
install them, which can save you time. However, you will still need to
run the library
function on each package before you can use
them. 0 Now - finally - you have access to the funky /
fancy library of ggplotting functions.
The ggplot function allows for some more flexible plotting, but (at least initialy) with more complicated syntax. For example, the equivalent of the plot above is:
ggplot(SeaOtters, aes(x = year, y = count)) + geom_point()
Now it is easy to add other features. Unfortunately, this data set only has the two columns, but you can still have some fun. For example:
ggplot(SeaOtters, aes(x = year, y = count, color = count, size = year)) + geom_point()
Let’s first start by fitting a straightforward linear model, which is to say we’re looking for values of \(\alpha\) (intercept) and \(\beta\) (slope) in this equation:
\[Y_i = \alpha + \beta\, X_i + \epsilon_i\].
where \(X\) is a predictor, and \(Y\) is a response, and \(\epsilon\) is unexplained (residual) variation. Remember, that the syntax is exactly the same as for plotting:
otter.fit <- lm(count ~ year, data = SeaOtters)
Note, we take a “model fit” (which is a complicated thing) and we put
it into a new object called otter.fit
. The output is just
the estimated parameters for the intercept and the “Year effect”
(i.e. slope of the curve).
otter.fit
##
## Call:
## lm(formula = count ~ year, data = SeaOtters)
##
## Coefficients:
## (Intercept) year
## -65751.54 33.19
Again, there is a lot more information if you ask for a summary:
summary(otter.fit)
##
## Call:
## lm(formula = count ~ year, data = SeaOtters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -152.74 -78.18 -58.79 34.71 417.48
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -65751.54 6158.81 -10.68 2.19e-10 ***
## year 33.19 3.08 10.78 1.83e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 142.4 on 23 degrees of freedom
## Multiple R-squared: 0.8347, Adjusted R-squared: 0.8275
## F-statistic: 116.1 on 1 and 23 DF, p-value: 1.83e-10
Of importance to us is the Estimate
(same as above) and
the Std. Error
which quantifies how precise our estimate
is. Two standard errors is (roughly) the same as a 95%
Confidence Interval. Thus, the slope on this fit is 33.2 new
sea otters / year ± 6.18. This is definitely bigger than 0! We can also
see that because the p-value (under column
Pr(>|t|)
) is very very very very small.
We can plot this linear model on our data:
plot(count ~ year, data = SeaOtters)
abline(otter.fit)
What do you observe about this fit!? Is it a good one?
We kind of knew the growth was not linear (otherwise this week’s topic wouldn’t have been called Exponential Growth). But how can we use a linear model to fit an exponential growth function?
With a simple transformation!
\[\log(Y_i) = \alpha + \beta\, X_i + \epsilon_i\]
How does this relate to an exponential growth model?
\[e^{\log(Y_i)} = e^{\alpha + \beta\, X_i + \epsilon_i}\] \[Y_t = N_0 e^{\beta\, X_t + \epsilon_t}\]
We replaced \(e^\alpha\) with \(N_0\), and \(i\) with \(t\). But \(\beta\) is EXACTLY the growth rate \(r\) (and \(e^\beta = \lambda\)). The \(\epsilon_i\) hanging on the end is a bit of randomness
Anyways, doing this with our linear modeling tools is easy. We just
wrap count
in a log()
function.
logotter.fit <- lm(log(count) ~ year, data = SeaOtters)
summary(logotter.fit)
##
## Call:
## lm(formula = log(count) ~ year, data = SeaOtters)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.191084 -0.062944 -0.005104 0.055518 0.231704
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.402e+02 4.732e+00 -29.63 <2e-16 ***
## year 7.325e-02 2.367e-03 30.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1094 on 23 degrees of freedom
## Multiple R-squared: 0.9766, Adjusted R-squared: 0.9755
## F-statistic: 958.1 on 1 and 23 DF, p-value: < 2.2e-16
Note how our statistics changed. Specifically: our \(\beta\) estimate (which is also \(r\)) is now 0.07325, which is a good estimate of the intrinsic growth rate. Even better, we now have a standard error around that estimate: 0.00237, which means our growth rate can be written as: \(\widehat{r} = (7.32 \pm 0.47)\times 10^{-2}\). We can convert that to an annual growth rate in R as well (do NOT worry about understanding the following code:
r.hat <- summary(logotter.fit)$coefficients[2,1]
r.sd <- summary(logotter.fit)$coefficients[2,2]
exp(c(r.hat, r.hat-2*r.sd, r.hat+2*r.sd))
## [1] 1.076001 1.070920 1.081106
So our estimate of annual population growth is 7.6% (95% Confidence Interval 7.1%-8.1%). Note how much higher the \(R^2\) value is! This is (by that measure also) a much better fit.
To visualize the fit:
plot(log(count) ~ year, data = SeaOtters)
abline(logotter.fit)
Much better!
Let’s make our final graph a little nicer
plot(log(count) ~ year, main = 'Sea Otter Population',
col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
abline(logotter.fit, col = "blue", lwd = 2)
The commands we’re using within the plot
function
include setting the title of the graph (main =
), changing
the color of the symbols and the line (col =
), and changing
the shape of the symbols (pch =
). There also many more
alterations you can make to get your graph perfect - but that can be a
deep dark R
abbit hole!
Finally, we can also draw the fit on non-log transformed data as follows:
plot(count ~ year, main = 'Sea Otter Population',
col = 'brown', pch = 4, data = SeaOtters, lwd = 2)
curve(exp(logotter.fit$coef[1] +
logotter.fit$coef[2] * x), col = "blue", add = TRUE, lwd = 2)
Analyze the population growth rate of sea otters in British Columbia, the population just to the north of Washington state. The data are here in a file called
BC_SeaOtters
.
- Load the data
- plot the data,
- report how many were released, in what year, how many are there in the final year of data,
- estimate the growth rate with confidence intervals.
NOTE: You can skip the section below as it is non-essential. But arguably “fun”.
ggplot
plottingIt can be a bit fussy to plot the confidence interval around our
prediction. Below, we include a little snippet of code which does just
that using ggplot
. You do not need to understand this code
(there’s actually quite a bit going on to get it to work right), but the
output is pretty.
require(ggplot2)
ggplot(SeaOtters, aes(year, count)) + geom_point() +
stat_smooth(method = "glm",
method.args = list(family = gaussian(link = "log")))
## `geom_smooth()` using formula = 'y ~ x'