First …. packages!
packages <- c("ggplot2", "ggmap", "plyr", "magrittr", "rstan", "coda", "lattice", "mapview")
sapply(packages, require, char = TRUE)
## ggplot2 ggmap plyr magrittr rstan coda lattice mapview
## TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Beautiful lilac flowering data are here: ftp://ftp.ncdc.noaa.gov/pub/data/paleo/phenology/north_america_lilac.txt
I tidied the data and posted them to the website here:
ll <- read.csv("data/LilacLocations.csv")
lp <- read.csv("data/LilacPhenology.csv")
The lilac locations have latitudes, longitudes and eleveations:
head(ll)
## X ID STATION State Lat Long Elev Nobs
## 1 1 20309 APACHE POWDER COMPANY AZ 31.54 -110.16 1125.00 9
## 2 2 20958 BOWIE AZ 32.20 -109.29 1145.12 6
## 3 3 21101 BURRUS RANCH AZ 35.16 -111.32 2073.17 9
## 4 4 21614 CHILDS AZ 34.21 -111.42 807.93 6
## 5 5 21634 CHINLE AZ 36.09 -109.32 1688.41 10
## 6 6 21749 CIBECUE AZ 34.03 -110.27 1615.85 8
The lilac phenology contains, importantly, the day of blooming (our response variable) across years at each station.
head(lp)
## STATION YEAR SPECIES LEAF BLOOM
## 1 20309 1957 2 NA 67
## 2 20309 1958 2 NA 90
## 3 20309 1959 2 NA 92
## 4 20309 1960 2 NA 79
## 5 20309 1961 2 NA 78
## 6 20309 1962 2 NA 88
Map all these data with mapview()
:
require(mapview); require(sf)
ll.sf <- st_as_sf(ll, coords = c("Long","Lat"), crs = st_crs(4326))
mapview(ll.sf)
#basemap1 <- get_map(location = c(-112, 38), maptype = "terrain", zoom=4)
#ggmap(basemap1) + geom_point(data = ll, mapping = aes(x = Long, y = Lat, colour = Elev, size = Nobs))
Pick out a few widely scattered stations (try this and all of the following with your own set!)
Stations <- c(456624, 426357, 354147, 456974)
subset(ll, ID %in% Stations)
## X ID STATION State Lat Long Elev Nobs
## 779 779 354147 IMNAHA OR 45.34 -116.50 564.02 30
## 894 894 426357 OAK CITY UT 39.23 -112.20 1547.26 32
## 986 986 456624 PORT ANGELES WA 48.07 -123.26 60.98 28
## 994 994 456974 REPUBLIC WA 48.39 -118.44 792.68 35
and tidy the data:
lilacs <- subset(lp, STATION %in% Stations)
lilacs <- lilacs[!is.na(lilacs$BLOOM),]
Plot some trends over time, with a “smoother” in ggplot:
require(ggplot2)
ggplot(lilacs, aes(x=YEAR, y=BLOOM)) +
geom_point() +
facet_grid(.~STATION) +
stat_smooth()
Note: It is easy and seductive to make figures like the above - but
an imnportant problem with any black box method (like
stat_smooth()
) is that the actual tool is obscured. In this
case, the smoother is “loess” (localized regression). Replicating the
figure in base plot is a good way to control that you know what’s
actually going on.
for(s in Stations)
{
years <- 1955:2000
mydata <- subset(lilacs, STATION==s)
with(mydata, plot(YEAR, BLOOM, main=s, xlim=range(years)))
myloess <- loess(BLOOM~YEAR, data=mydata)
myloess.predict <- predict(myloess, data.frame(YEAR = years), se=TRUE)
with(myloess.predict, lines(years, fit))
with(myloess.predict, lines(years, fit))
with(myloess.predict, lines(years, fit - 2*se.fit, col="grey"))
with(myloess.predict, lines(years, fit + 2*se.fit, col="grey"))
}
Here are our selected stations:
ll.mystations <- (ll.sf %>% subset(ID %in% Stations))[,"STATION"]
mapview(ll.mystations, map.types = "Esri.WorldPhysical")
It looks like the blooming time of lilacs since 1980 has started happening earlier and earlier. We propose a hockeystick model, i.e.: \[ X_{i,j}(T) = \begin{cases} \beta_0 + \epsilon_{ij} & T < \tau \\ \beta_0 + \beta_1(T - \tau) + \epsilon_{ij} & T \geq \tau \end{cases} \]
Encode the model in R (note the ifelse()
function):
getX.hat <- function(T, beta0, beta1, tau)
ifelse(T < tau, beta0, beta0 + beta1 * (T-tau))
Guess some values, and see how it looks. We’ll focus on just one station, and make things easier by extracting htose columns:
l1 <- subset(lilacs, STATION == Stations[1])
p <- c(120, -2, 1980, 1)
plot(BLOOM~YEAR, pch = 19, col="grey", data = l1)
lines(l1$YEAR, getX.hat(l1$YEAR, beta0=120, beta1 = -2, tau = 1980), col=2, lwd=2)
Ok, before we run off to analyze this model, lets first encode a simple linear regression in STAN:
The basic idea of using RSTAN is:
STAN uses a pseudo-R pseudo-C type. The specifications and examples are given in the very useful manual here: http://mc-stan.org/manual.html
Here is an example of the linear model, coded in STAN:
data {
int<lower=0> n; //
vector[n] y; // Y Vector
vector[n] x; // X Vector
}
parameters {
real beta0; // intercept
real beta1; // slope
real<lower=0> sigma; // standard deviation
}
model {
//Declare variable
vector[n] yhat;
//Priors
sigma ~ inv_gamma(0.001, 0.001);
beta0 ~ normal(0,1e2);
beta1 ~ normal(0,1e2);
//model
yhat = beta0 + beta1 * (x - mean(x));
y ~ normal(yhat, sigma);
}
There are several pieces (called “blocks”) here to go over.
n
because it is used later in the model specification.yhat
variable is one more vector that must be defined - and
all variables must be specified at the beginning of the block. Note,
also, I put in a loop to define the yhat
, to illustrate the
loop syntax. But in fact STAN is quite good at vectorization, and that
vectorization can be somewhat and that loop could be replaced simply
with:yhat <- beta0 + beta1 * (x - mean(x));
In principle STAN code is straightforward, but there are always traps and pitfalls, that take a while to debug (and involve a fair amount of looking at the manual and googling errors).
Note that the priors I selected were fairly uninformative. The inverse gamma distribution with low values of \(\alpha\) and \(\beta\) is a common one for the variance, which should be positive.
The big step (the SLOW step) is the transformation of the STAN
specification to C++ code, and then compilation of that code. This is
done with the command stan_model
. In theory you could do
this in-line, but the best way is to save the STAN code (e.g., save the
STAN code above as lm.stan
in your working directory) and
call it from the command line as follows.
require(rstan)
lm_stan <- stan_model(file = "lm.stan", auto_write = TRUE)
This takes a few minutes to perform on my computer. Compiled versions (of all the Stan models in this lab) can be found here: http://faculty.washington.edu/eliezg/teaching/StatR503/stanmodels/
To actually run the MCMC on the data and the model you use the
sampling
command. A few things to note:
Data
must be a named list with the variable names
and types corresponding strictly to the definition sin the
data
block of the stan model.l1 <- subset(lilacs, STATION == Stations[1])
Data <- list(n = nrow(l1), x = l1$YEAR, y = l1$BLOOM)
lm.fit <- sampling(lm_stan, data = Data)
##
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 1.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.19 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.043052 seconds (Warm-up)
## Chain 1: 0.04444 seconds (Sampling)
## Chain 1: 0.087492 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 8e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.052391 seconds (Warm-up)
## Chain 2: 0.042902 seconds (Sampling)
## Chain 2: 0.095293 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 8e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.053043 seconds (Warm-up)
## Chain 3: 0.040843 seconds (Sampling)
## Chain 3: 0.093886 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'lm' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 5.8e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.58 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.052797 seconds (Warm-up)
## Chain 4: 0.044999 seconds (Sampling)
## Chain 4: 0.097796 seconds (Total)
## Chain 4:
This takes under a second to perform on my machine.
A brief summary of the results:
lm.fit
## Inference for Stan model: lm.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta0 116.30 0.03 1.69 112.94 115.14 116.31 117.45 119.59 3041 1
## beta1 -0.60 0.00 0.14 -0.88 -0.70 -0.60 -0.51 -0.33 3735 1
## sigma 9.11 0.02 1.32 6.99 8.15 8.96 9.92 12.09 3045 1
## lp__ -76.24 0.03 1.22 -79.37 -76.81 -75.94 -75.35 -74.80 1825 1
##
## Samples were drawn using NUTS(diag_e) at Wed Oct 12 17:55:32 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
You can see from the output that the regression coefficients are around 116.5 and -0.6, with a standard deviation of 9.1. It is easy (of course) to compare these results with a straightforward OLS:
summary(lm(BLOOM ~ I(YEAR-mean(YEAR)), data = l1))
##
## Call:
## lm(formula = BLOOM ~ I(YEAR - mean(YEAR)), data = l1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.565 -5.131 -1.802 3.396 19.382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.3571 1.6717 69.606 < 2e-16 ***
## I(YEAR - mean(YEAR)) -0.6052 0.1374 -4.405 0.000162 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.846 on 26 degrees of freedom
## Multiple R-squared: 0.4273, Adjusted R-squared: 0.4053
## F-statistic: 19.4 on 1 and 26 DF, p-value: 0.0001617
The estimate of standard deviation is given by:
summary(lm(BLOOM ~ I(YEAR-mean(YEAR)), data = l1))$sigma
## [1] 8.845546
Should be a pretty good match.
To apply the coda
family of diagnostic tools, you need
to extract
the chains from the STAN fitted object, and
re-create it is as an mcmc.list
. This is done with the
As.mcmc.list
command.
require(coda)
lm.chains <- As.mcmc.list(lm.fit, pars = c("beta0", "beta1", "sigma"))
plot(lm.chains)
You can see that the chains seem fairly well behaved, i.e. they converge and the densities are consistent between chains. Here is a nicer looking plot of the posterior distributions:
require(lattice)
densityplot(lm.chains)
Ok, now we will try to fit this hockeystick model to one of these
datasets. Note that the version below basically adjusts the
lm.stan
model with just a few tweaks, adding that important
step function via the STAN function int_step
(which is 0
when the argument is less than 0, and 1 when the argument is greater
than 0):
data {
int<lower=0> n; //
vector[n] y; // Y Vector
vector[n] x; // X Vector
}
parameters {
real beta0; // intercept
real beta1; // slope
real<lower=min(x), upper=max(x)> tau; // changepoint
real<lower=0> sigma; // variance
}
model {
vector[n] yhat;
// Priors
sigma ~ inv_gamma(0.001, 0.001);
beta0 ~ normal(0,1000);
beta1 ~ normal(0, 1000);
tau ~ uniform(min(x), max(x));
// Define the mean
for(i in 1:n)
yhat[i] = beta0 + int_step(x[i]-tau)*beta1*(x[i] - tau);
// likelihood / probability model
y ~ normal(yhat, sigma);
}
Compile:
hockeystick_stan <- stan_model(file = "stanmodels/hockeystick.stan")
hockeystick.fit <- sampling(hockeystick_stan, Data)
##
## SAMPLING FOR MODEL 'hockeystick' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 2.9e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.29 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.145522 seconds (Warm-up)
## Chain 1: 0.131413 seconds (Sampling)
## Chain 1: 0.276935 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'hockeystick' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 1.1e-05 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.11 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.129066 seconds (Warm-up)
## Chain 2: 0.109979 seconds (Sampling)
## Chain 2: 0.239045 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'hockeystick' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 1.7e-05 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.140746 seconds (Warm-up)
## Chain 3: 0.117407 seconds (Sampling)
## Chain 3: 0.258153 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'hockeystick' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 1.7e-05 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.17 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.141175 seconds (Warm-up)
## Chain 4: 0.126034 seconds (Sampling)
## Chain 4: 0.267209 seconds (Total)
## Chain 4:
This takes a little bit longer to fit, but still just on the order of seconds.
Look at the estimates:
hockeystick.fit
## Inference for Stan model: hockeystick.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## beta0 123.13 0.04 1.84 119.45 121.93 123.13 124.35 126.78 2459 1
## beta1 -1.99 0.02 0.55 -3.11 -2.29 -1.94 -1.62 -1.10 997 1
## tau 1979.98 0.07 2.51 1974.64 1978.57 1980.25 1981.61 1984.29 1452 1
## sigma 7.12 0.02 1.11 5.35 6.35 6.98 7.73 9.55 2129 1
## lp__ -66.56 0.05 1.67 -70.61 -67.35 -66.21 -65.35 -64.51 989 1
##
## Samples were drawn using NUTS(diag_e) at Wed Oct 12 17:55:38 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
They look quite reasonable and converged!
Let’s assess the fits:
require(lattice)
hockeystick.mcmc <- As.mcmc.list(hockeystick.fit, c("beta0", "beta1","tau","sigma"))
xyplot(hockeystick.mcmc)
densityplot(hockeystick.mcmc)
Great fits, nicely behaved chains. And you can see how the final distributions are pretty normal in the end.
The conclusion is that there was likely a structural change in the blooming of this lilac time series occuring somewhere around 1980 (between 1974 and 1984 for sure).
We can see what the point estimate of this model fit might look like,
using the getX.hat()
function above:
getX.hat <- function(T, beta0, beta1, tau)
ifelse(T < tau, beta0, beta0 + beta1 * (T-tau))
plot(BLOOM ~ YEAR, pch=19, col=rgb(0,0,0,.5), cex=2, data = l1)
(p.hat <- summary(hockeystick.mcmc)$quantiles[,"50%"])
## beta0 beta1 tau sigma
## 123.132467 -1.935706 1980.250508 6.981890
lines(1950:2000, getX.hat(1950:2000, p.hat['beta0'], p.hat['beta1'], p.hat['tau']), lwd=2, col="darkred")
A nice fit. You could also sample from the MCMC results to draw a 95% prediction interval around the time series, though that is a bit trickier to do compactly. Something like this:
par(bty="l")
hockeystick.chains <- as.array(hockeystick.mcmc)
years <- 1950:2000
hockey.sims <- matrix(0, nrow=length(years), ncol=1e4)
for(i in 1:1e4)
{
b0 <- sample(hockeystick.chains[,"beta0",], 1)
b1 <- sample(hockeystick.chains[,"beta1",], 1)
tau <- sample(hockeystick.chains[,"tau",], 1)
hockey.sims[,i] <- getX.hat(years, b0, b1, tau)
}
CIs <- apply(hockey.sims, 1, quantile, p=c(0.025, 0.5, 0.975))
plot(BLOOM ~ YEAR, pch=19, col=rgb(0,0,0,.5), cex=2, data = l1)
lines(years, CIs[1,], col="red", lwd=2, lty=3)
lines(years, CIs[2,], col="red", lwd=2)
lines(years, CIs[3,], col="red", lwd=2, lty=3)
There is a way to fit the “hockey-stick” model to these data using
the non-linear least squares nls
function in R. Compare the
output of the MCMC to these results:
lilacs.nls <- nls( BLOOM ~ getX.hat(YEAR, beta0, beta1, tau),
start=list(beta0=120, beta1=-2,tau=1980),
data = l1)
summary(lilacs.nls)
##
## Formula: BLOOM ~ getX.hat(YEAR, beta0, beta1, tau)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## beta0 123.1875 1.7183 71.691 < 2e-16 ***
## beta1 -1.9555 0.5356 -3.651 0.00121 **
## tau 1980.1831 2.6049 760.182 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.873 on 25 degrees of freedom
##
## Number of iterations to convergence: 2
## Achieved convergence tolerance: 7.425e-09
The nls
function fits the curve by minimizing the least
squared …
… which is the equivalent of fitting a model with a Gaussian error, which we can also do by obtating an MLE. The likelihood function is:
hockeystick.ll <- function(par, time, x){
x.hat <- getX.hat(time, par["beta0"], par["beta1"], par["tau"])
residual <- x - x.hat
-sum(dnorm(residual, 0, par["sigma"], log=TRUE))
}
par0 <- c(beta0 = 120, beta1 = -2, tau=1980, sigma = 7)
fit.ll <- optim(par0, hockeystick.ll, time = l1$YEAR, x = l1$BLOOM)
fit.ll$par
## beta0 beta1 tau sigma
## 120.437179 -2.133036 1981.999999 6.427344
Brief pro-tip … since \(\sigma\) is positive, you’re almost always better off estimating \(\log(\sigma)\), e.g.
hockeystick.ll <- function(par, time, x){
x.hat <- getX.hat(time, par["beta0"], par["beta1"], par["tau"])
residual <- x - x.hat
-sum(dnorm(residual, 0, exp(par["logsigma"]), log=TRUE))
}
par0 <- c(beta0 = 120, beta1 = -2, tau=1980, logsigma = 1)
fit.ll <- optim(par0, hockeystick.ll, time = l1$YEAR, x = l1$BLOOM)
fit.ll$par
## beta0 beta1 tau logsigma
## 123.266231 -1.976351 1980.347135 1.862857
“Better off” - in this case - meaning: fewer errors and a faster solution. Also - it’s a quick way to provide the common bound: \([0,\infty]\). The estimate, in the end, is the same:
exp(fit.ll$par['logsigma'])
## logsigma
## 6.442117
Lets now try to fit a hierarchical model, in which we want to acknowledge that there are lots of sites and variability across them, and we want to see how variable the main timing parameters (flat intercept - \(\beta_0\), decreasing slope - \(\beta_1\)) are across locations.
In this hierarchical model, we specify several are wide distributions for the site-specific parameters, thus:
\[Y_{ij} = f(X_j | \beta_{0,i}, \beta_{1,i}, \tau, \sigma)\]
Where \(i\) refers to each site, and \(j\) refers to each year observation.
\[ \beta_0 \sim {\cal N}(\mu_{\beta 0}, \sigma_{\beta 0}); \beta_1 \sim {\cal N}(\mu_{\beta 1}, \sigma_{\beta 1})\]
We now have 6 parameters, the mean and standard deviation across all sites for two parameters and the overall change point and variance.
The complete STAN code to fit this model is below:
data {
int<lower=0> n; // number of total observations
int<lower=1> nsites; // number of sites
int<lower=1,upper=nsites> site[n]; // vector of site IDs
real y[n]; // flowering date
real x[n]; // year
}
parameters {
real mu_beta0; // mean intercept
real<lower=0> sd_beta0; // s.d. of intercept
real mu_beta1; // mean slope
real<lower=0> sd_beta1; // s.d. of intercept
// overall changepoint and s.d.
real<lower = min(x), upper = max(x)> tau;
real<lower=0> sigma;
vector[nsites] beta0;
vector[nsites] beta1;
}
model {
vector[n] yhat;
// Priors
sigma ~ inv_gamma(0.001, 0.001);
mu_beta0 ~ normal(0,1e4);
sd_beta0 ~ inv_gamma(0.001, 0.001);
mu_beta1 ~ normal(0, 10);
sd_beta1 ~ inv_gamma(0.001, 0.001);
tau ~ uniform(min(x), max(x));
for (i in 1:nsites){
beta0[i] ~ normal(mu_beta0, sd_beta0);
beta1[i] ~ normal(mu_beta1, sd_beta1);
}
// The model!
for (i in 1:n)
yhat[i] = beta0[site[i]] + int_step(x[i]-tau) * beta1[site[i]] * (x[i] - tau);
// likelihood / probability model
y ~ normal(yhat, sigma);
}
Several things to note here:
The data
module now includes a vector of site
locations site
, and an extra important variable
nsites
that is simply the number of unique sites.
The parameter
module include the parameters above,
but will also keep track of the individual beta0
and
beta1
values that are estimated [at no extra cost] for each
site.
The model
module specifies that the values of the
paramters in each site are drawn from the normal distributions, and then
obtains the mean function (yhat
) referencing the parameter
values that are particular to each site.
The STAN code here is saved under the filename
hhs.stan
.
Compile:
hhs_stan <- stan_model(file = "stanmodels/hhs.stan")
Let’s randomly pick out 30 sites with the most observations (e.g. at least 20), and plot them:
# only lilacs, non NA
lp2 <- subset(lp, !is.na(BLOOM) & SPECIES == 2)
somestations <- names(which(table(lp2$STATION)>20)) %>% sample(30)
lilacs <- subset(lp2, STATION %in% somestations) %>%
merge(ll, by.x = "STATION", by.y = "ID")
A map:
First, we have to define the data, recalling that every declared bit of data in the original data module must be present:
Data <- with(lilacs, list(x = YEAR, y = BLOOM, n = length(YEAR),
site = STATION %>% factor %>% as.integer,
nsites = length(unique(STATION))))
These data are longer - in my sample of sites, there are over 700 rows. ALso, the model is more complex and the sampling will take more time, so for now I only do 2 chains.
load(file="stanmodels/hhs.stanmodel")
hhs.fit <- sampling(hhs_stan, Data, iter = 2000, chains = 2)
##
## SAMPLING FOR MODEL 'hhs' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0.000147 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.47 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 10.8029 seconds (Warm-up)
## Chain 1: 5.43867 seconds (Sampling)
## Chain 1: 16.2416 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'hhs' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0.000118 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 1.18 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 11.7383 seconds (Warm-up)
## Chain 2: 7.088 seconds (Sampling)
## Chain 2: 18.8263 seconds (Total)
## Chain 2:
Note that the output of hhs.fit
shows estimates for the
seven parameters of interest, as well as values of the three parameters
for EACH site. We won’t show those here, but here are the main ones:
print(hhs.fit, pars = c("mu_beta0", "sd_beta0", "mu_beta1","sd_beta1", "tau", "sigma"))
## Inference for Stan model: hhs.
## 2 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=2000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff
## mu_beta0 120.06 0.13 4.46 111.30 117.15 120.01 122.94 128.59 1206
## sd_beta0 24.77 0.08 3.39 19.14 22.37 24.44 26.91 32.44 1603
## mu_beta1 -0.42 0.01 0.11 -0.64 -0.49 -0.41 -0.34 -0.22 426
## sd_beta1 0.23 0.01 0.11 0.05 0.16 0.22 0.30 0.47 120
## tau 1973.60 0.17 3.21 1964.98 1972.23 1974.43 1975.55 1978.48 356
## sigma 9.49 0.01 0.26 9.01 9.32 9.48 9.66 10.08 643
## Rhat
## mu_beta0 1.00
## sd_beta0 1.00
## mu_beta1 1.00
## sd_beta1 1.00
## tau 1.00
## sigma 1.01
##
## Samples were drawn using NUTS(diag_e) at Wed Oct 12 17:56:29 2022.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
Visualiziaing the chains:
hhs.mcmc <- As.mcmc.list(hhs.fit, pars = c("mu_beta0", "sd_beta0", "mu_beta1","sd_beta1", "tau", "sigma"))
xyplot(hhs.mcmc)
densityplot(hhs.mcmc)
You might (depending on your specific results) see that the question of “convergence” is a little bit more open here. But certainly the wide variation in the mean intercept \(\beta_0\) is notable, as is the uniformly negative slope \(\beta_1\).
We’re going to do one more twist on the model, and that is to try to narrow the rather large variation in the intercept by using our latitude and longitude information. In this model, the intercept (\(\beta_0\)) will be explained by elevation and latitude - the higher and further north you are, the more later flowers tend to bloom. So in this model, everyting will be the same as above, except that rather than have \(\beta_0 \sim {\cal N}(\mu, \sigma)\) - it is going to be a regression model:
\[\beta_0 \sim {\cal N}({\bf X} \beta, \sigma)\]
where \({\bf X}\) represents a design matrix of the form:
\[\left[\begin{array}{ccc} 1 & E_1 & L_1 \\ 1 & E_2 & L_2 \\ 1 & E_3 & L_3 \\ ... & ... & ...\\ \end{array}\right]\]
where \(E\) and \(L\) represent elevation and latitude.
Coding this adds a few quirks to the STAN code, but surprisingly not many more lines. See the complete code here: https://terpconnect.umd.edu/~egurarie/teaching/Biol709/stanmodels/hhswithcovar.stan.
## hhswc_stan <- stan_model("hhswithcovar.stan")
A quick way to get a covariate matrix below. Note that I center the covariates, so that the intercept is the mean value, and the slopes tell me the number of days PER degree latitude and PER 1 km elevation:
covars <- with(lilacs, model.matrix(BLOOM ~ I(Lat - mean(Lat)) + I((Elev - mean(Elev))/1000)))
head(covars)
## (Intercept) I(Lat - mean(Lat)) I((Elev - mean(Elev))/1000)
## 1 1 -8.921981 -0.1042402
## 2 1 -8.921981 -0.1042402
## 3 1 -8.921981 -0.1042402
## 4 1 -8.921981 -0.1042402
## 5 1 -8.921981 -0.1042402
## 6 1 -8.921981 -0.1042402
Which we now include in our Data list:
Data <- with(lilacs,
list(x = YEAR, y = BLOOM, n = length(YEAR),
site = STATION %>% factor %>% as.integer,
nsites = length(unique(STATION)),
covars = covars, ncovars = ncol(covars)))
And sample:
hhswc.fit <- sampling(hhswc_stan, Data, chains = 4, iter = 2000)
pars <- names(hhswc.fit)[1:8]
hhswc.mcmc <- As.mcmc.list(hhswc.fit, pars)
summary(hhswc.mcmc)
##
## Iterations = 501:1000
## Thinning interval = 1
## Number of chains = 3
## Sample size per chain = 500
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## mu_beta0[1] 132.6370 1.2912 0.033339 0.084821
## mu_beta0[2] 4.3794 0.5781 0.014927 0.020174
## mu_beta0[3] 29.5097 2.6927 0.069525 0.222257
## sd_beta0 2.6039 0.4218 0.010891 0.030113
## mu_beta1 -0.5833 0.1760 0.004545 0.018254
## sd_beta1 0.1882 0.1021 0.002636 0.012836
## tau 1977.4526 2.9510 0.076195 0.281821
## sigma 8.9201 0.2245 0.005797 0.005526
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## mu_beta0[1] 130.18851 131.7818 132.5883 133.4834 135.2406
## mu_beta0[2] 3.24255 3.9790 4.3807 4.7492 5.5361
## mu_beta0[3] 24.24777 27.7886 29.5411 31.0419 35.0936
## sd_beta0 1.94076 2.2864 2.5657 2.8613 3.5581
## mu_beta1 -0.97284 -0.6978 -0.5485 -0.4604 -0.3066
## sd_beta1 0.04364 0.1058 0.1762 0.2490 0.4150
## tau 1972.55455 1975.2656 1976.8866 1979.8852 1982.3789
## sigma 8.49819 8.7766 8.9151 9.0659 9.3666
xyplot(hhswc.mcmc)
densityplot(hhswc.mcmc)
Note that this model provides much more consistent, converging results! This is because more data were leveraged, and the response is clearly a strong one. The “base” mean day was around 136, the latitude effect is about +3.74 days per degree latitude, and the elevation effect is about +30 days for every 1000 m. The slope is still significantly negative, and the mean transition year is somewhere in the early-mid 70’s.