Processing math: 100%
+ - 0:00:00
Notes for current slide
Notes for next slide

How to model just about anything
(but especially habitat)

EFB 390: Wildlife Ecology and Management

Dr. Elie Gurarie

September 28, 2023

1 / 37

Super fast primer on statistical modeling

Everything you need to know to do 95% of all wildlife modeling in less than an hour and FOUR (or FIVE) easy steps!!

I. Linear modeling

II. Multivariate modeling

III. Model selection

IV. Generalized linear modeling

  • Poisson; Binomial

V. Prediction

2 / 37

Step I: Linear modeling

... is a very general method to quantifying relationships among variables.

Xi - is called:

  • covariate
  • independent variable
  • explanatory variable

Yi - is the property we are interested in modeling:

  • response variable
  • dependent variable

Note: There actually can be interest in wildlife studies to have models for length and weight, since length is easy to measure (e.g. from drones), but weight tells us more about physical condition and energetics.

Steller sea lion (Eumatopias jubatus) pups.

3 / 37

Linear Models

Deterministic:

Yi=a+bXi

a - intercept; b - slope

Probabilistic:

Yi=α+βXi+ϵi

α - intercept; β - slope; ϵ - randomness!: ϵiN(0,σ)

4 / 37

Fitting linear models is very easy in !

Point Estimate

This command fits a model:

lm(Weight ~ Length, data = pups)
##
## Call:
## lm(formula = Weight ~ Length, data = pups)
##
## Coefficients:
## (Intercept) Length
## -49.1422 0.7535

So for each 1 cm of length, add another 754 grams, i.e. (\widehat{\beta} = 0.754)

plot(Weight ~ Length, data = pups)
abline(my_model)

The abline puts a line, with intercept a and slope b onto a figure.

5 / 37

Statistical inference

Statistical inference is the science / art of observings something from a portion of a population and making statements about the entire population.

In practice - this is done by taking data and estimating parameters of a model. (This is also called fitting a model).

Two related goals:

  1. obtaining a point estimate and a confidence interval (precision) of the parameter estimate.
  2. Assessing whether particular (combinations of) factors, i.e. models, provide any explanatory power.

This is (almost always) done using Maximum Likelihood Estimation, i.e. an algorithm searches through possible values of the parameters that make the model MOST LIKELY (have the highest probability) given the data.

Another gratuitous sea lion picture.

6 / 37

All models have these pieces:

Y=f(X|Θ)

  • Y - response | dependent variable. The thing we want to model / predict / understand. The effect (maybe).
  • X - predictor(s) | independent variable(s) | covariate(s). The thing(s) that "explain(s)" Y. The cause (maybe).

  • f - the model structure. This includes: some deterministic functional form form (linear? periodic? polynomial? exponential?) AND some probabilistic assumptions, i.e. a way to characterize the variability / randomness / unpredictability of the process.

  • Θ - the parameters of the model. There are usually some parameters associated with the predictors, and some associated with the random bit.

7 / 37

Goals (Art / Science) of Modeling

Y=f(X|Θ)

1. Model fitting

What are the best Θ values given f,X,Y?

Fitting the model = estimating the parameters.

Usually according to some criterion (almost always Maximum Likelihood.

8 / 37

Goals (Art / Science) of Modeling

Y=f(X|Θ)

1. Model fitting

What are the best Θ values given f,X,Y?

Fitting the model = estimating the parameters.

Usually according to some criterion (almost always Maximum Likelihood.

2. Model selection

What are the best of a set of models f1, f2, f3 given X and Y?

Different models usually vary by what particular variables go into X, but can also vary by functional form and distribution assumptions

Use some Criterion (e.g. AIC) to "select" the best model, which balances how many parameters you estimated verses how good the fit is.

8 / 37

Whoa! What is "Maximum Likelihood"!?

Oakie

Orange

Q: What is the "best model" for squirrel morph distribution?

9 / 37

Y=f(X|Θ)

Linear model

Two ways to write the same thing:

Yi=α+βXi+ϵi YiN(α+βXi,σ)

  • Response Y - Weight

  • Predictor X - Length

  • Parameters Θ: α; β; σ

  • Function f(): Normal distribution N()

10 / 37

Statistical output

YiN(α+βXi,σ)

```
## 
## Call:
## lm(formula = Weight ~ Length, data = pups %>% subset(Island == 
##     "Raykoke"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.498 -1.718  0.023  1.764  7.276 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -49.14222    5.75796  -8.535 1.81e-13 ***
## Length        0.75345    0.05193  14.510  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.761 on 98 degrees of freedom
## Multiple R-squared:  0.6824,    Adjusted R-squared:  0.6791 
## F-statistic: 210.5 on 1 and 98 DF,  p-value: < 2.2e-16
```
11 / 37

Statistical output

YiN(α+βXi,σ)

```
## 
## Call:
## lm(formula = Weight ~ Length, data = pups %>% subset(Island == 
##     "Raykoke"))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.498 -1.718  0.023  1.764  7.276 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -49.14222    5.75796  -8.535 1.81e-13 ***
## Length        0.75345    0.05193  14.510  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.761 on 98 degrees of freedom
## Multiple R-squared:  0.6824,    Adjusted R-squared:  0.6791 
## F-statistic: 210.5 on 1 and 98 DF,  p-value: < 2.2e-16
```

1. Point estimates and confidence intervals

Intercept ( α ): 49.14±11.5

Slope ( β ): 0.75±0.104

2. Is the model a good one?

p-values are very very small, in particular for slope

Proportion of variance explained is high:

R2=0.68

11 / 37

Models and Hypotheses

Every p-value is a Hypothesis test.

Estimate Std. Error t value Pr(>|t|)
(Intercept) -49.142 5.758 -8.535 0
Length 0.753 0.052 14.510 0
  • First hypothesis test: H0 intercept = 0
  • Second hypothesis: H0 slope = 0

Both null-hypotheses strongly rejected.

12 / 37

But other variables might influence pup size

Lots of competing models with different main and interaction effects.

13 / 37

Linear modeling with a discrete factor

Yij=α+βi+ϵij

i is the index of sex (Male or Female), so there are two "Sex effects" - β1 and β2 representing the effect of the sex group; j is the index of the individual within each sex group i.

lm(Weight ~ Sex, data = pups)
term estimate std.error statistic p.value
(Intercept) 30.151 0.317 95.119 <2e-16
SexMale 6.149 0.429 14.337 <2e-16

Intercept here means mean female weight.

Note - this is very similar to a t-test comparing two means (baby stats).

14 / 37

Linear modeling with multiple factors

Very easy to extend this to more complicated models!

Yijk=α+βiIslandijk+γjSexijk+ϵijk

lm(Weight ~ Island + Sex, data = pups)
term estimate std.error statistic p.value
(Intercept) 31.04 0.54 57.62 <1e-16
IslandChirpoev -2.23 0.67 -3.34 0.001
IslandLovushki -0.84 0.67 -1.26 0.21
IslandRaykoke 0.14 0.67 0.21 0.83
IslandSrednova -1.50 0.67 -2.24 0.03
SexMale 6.14 0.42 14.47 1e-16
15 / 37

Analysis of Variance (ANOVA)

Is a technique for seeing which effect in a model is significant. Each row tests a hypothesis that the effect coefficients are non-zero.

In this model, we include an interaction, asking: "Do different Islands have different patterns among Sexes? (and vice versa)"

lm(Weight ~ Island * Sex, data = pups)

Analysis of Variance Table

Response: Weight
            Df  Sum Sq Mean Sq  F value    Pr(>F)    
Island       4   443.3   110.8   5.0114 0.0005763 ***
Sex          1  4623.9  4623.9 209.0758 < 2.2e-16 ***
Island:Sex   4    71.4    17.9   0.8075 0.5207439    
Residuals  488 10792.6    22.1                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Non-significant interaction term

Interpretation:

  • Differences between SEXES very significant (very very small p-value)
  • Differences among ISLANDS very significant (small p-value)
  • SEX differences among ISLANDS consistent (large interaction p-value)
  • ISLANDS differences between SEXES consistent (large interaction p-value)
16 / 37

Combining continuous and categorical variables

Exploratory plot

It looks like, maybe, there are different body proportions for MALES and FEMALES.
17 / 37

Combining continuous and categorical variables

Exploratory plot

It looks like, maybe, there are different body proportions for MALES and FEMALES.

ANOVA table confirms our suspicion!

Analysis of Variance Table

Response: Weight
            Df  Sum Sq Mean Sq  F value    Pr(>F)    
Length       1 12413.8 12413.8 1957.969 < 2.2e-16 ***
Sex          1   257.3   257.3   40.582 4.321e-10 ***
Length:Sex   1   128.1   128.1   20.208 8.662e-06 ***
Residuals  494  3132.0     6.3                       
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Highly significant interaction term.

17 / 37

Step III: Model Selection

ANOVA is helpful for "nested" models, where each one is a subset of another more complex one. For comparing a set of competing,non-nested models, we use .

ΔAIC table

Model k R2 logLik AIC dAIC
M0 1 1 0.000 -1569.5 3143.1 835.8
M1 Island 5 0.028 -1562.5 3137.0 829.7
M2 Sex 2 0.293 -1483.2 2972.4 665.1
M3 Length 2 0.779 -1193.4 2392.8 85.5
M4 Length + Sex 3 0.795 -1174.5 2357.0 49.7
M6 Length * Sex 4 0.803 -1164.5 2339.0 31.7
M5 Length + Sex + Island 7 0.811 -1155.0 2325.9 18.6
M7 Length * Sex + Island 8 0.818 -1144.6 2307.3 0.0
M8 Length * Sex * Island 20 0.824 -1137.1 2316.1 8.8

Degrees of freedom k:

  • Number of estimated parameters. Measure of complexity.

Coefficient of determination R2:

  • Percent variation explained. It ALWAYS increases the more complex the model.
  • Is is always zero for the NULL model.

log-likelihood log(L):

  • Total probability score of model. It ALWAYS increases the more complex the model.

Akaike Information Criterion:

  • AIC=2log(L)+2k
  • A measure of model quality.
  • Smaller is better It starts getting bigger if the model complexity gets too high.
  • The lowest AIC value is the "best" model.
  • (but within 2 ΔAIC is pretty much equivalent to best)
18 / 37

AIC in action: What predicts ungulate body size?

Quality (Nitrogen)? or Type (browse/grass)?

19 / 37

Caribou spring migrations

Remarkable temporal synchrony at a continental scale.

20 / 37

Could the synchrony be driven by global weather drivers?

Pacific Decadal Oscillation, Arctic Oscillation, North Atlantic Oscillation: determine whether the winter is wet & snowy or dry & cold.

21 / 37

ΔAIC Table 1: Departure time

... driven by LARGE climate oscillations.

22 / 37

ΔAIC Table 2: Arrival time

... completely independent of climate!

23 / 37

Step IV: Generalized linear modeling

Normal Model

YiNormal(α0+β1Xi,σ) Models continuous data with a "normal-like" distribution.

24 / 37

Step IV: Generalized linear modeling

Normal Model

YiNormal(α0+β1Xi,σ) Models continuous data with a "normal-like" distribution.

Binomial model

YiBernoulli(exp(α+βXi)1+exp(α+βXi))

There's some probability of something happening that depends on the predictor X.

Bernoulli just means the data are all 0 or 1.

This models presence/absence, dead/alive, male/female other response variables with 2 possible outcomes.

24 / 37

What factors predict occurence of Solea solea larvae?

Sampled in the estuary of the Tejo river in Portugal

  • Lots of environmental factors in data
depth temp salinity transp gravel large_sand fine_sand mud presence
3.0 20 30 15 3.74 13.15 11.93 71.18 0
2.6 18 29 15 1.94 4.99 5.43 87.63 0
2.6 19 30 15 2.88 8.98 16.85 71.29 1
2.1 20 29 15 11.06 11.96 21.95 55.03 0
3.2 20 30 15 9.87 28.60 19.49 42.04 0
3.5 20 32 7 32.45 7.39 9.43 50.72 0
25 / 37

Presence of Solea solea against salinity

Modeling is EXACTLY the same as linear regression except:

  • glm - for generalized linear model (instead of lm)
  • family = 'binomial' is the instruction to fit the logistic regression

glm(presence ~ salinity, family ='binomial')

Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.661 0.902 2.951 0.003
salinity -0.130 0.035 -3.716 0.000

Clearly - Solea solea presence is very significantly negatively related to salinity.

26 / 37

Out of this model we can make predictions

27 / 37

ΔAIC analysis - and coefficients

Model k logLik AIC dAIC
M9 salinity + gravel 3 -33.2 72.5 0.0
M2 salinity 2 -34.3 72.6 0.1
M7 temp + salinity 3 -34.0 74.0 1.5
M5 depth + salinity 3 -34.1 74.3 1.8
M11 depth + temp + salinity 4 -33.9 75.8 3.3
M0 depth 2 -38.1 80.1 7.6
M4 depth + temp 3 -38.0 81.9 9.4
M6 depth + gravel 3 -38.0 82.0 9.5
M10 depth + temp + gravel 4 -37.8 83.7 11.2
M1 temp 2 -43.3 90.6 18.1
M3 gravel 2 -43.7 91.3 18.8
M8 temp + gravel 3 -43.3 92.6 20.1

Salinity clearly among the more important covariates (in the top 4 models).

28 / 37

FLASHBACK: how the caribou Resource Selection Function was selected

29 / 37

Poisson regression

YiPoisson(λ=exp(α+βXi))

  • We are counting something ... the data are between 0 and
  • λ is a density; densities vary across habitat types (covariate X).

30 / 37

Field flags

Did flag densities vary with region?

Approximate areas:

region area
North: 82 m2
South: 82 m2
Perimeter: 196 m2
-- --
Sampling square (hula hoop) 0.5 m2

31 / 37

Count data

Lots of 0's, some 1's, and just one 2 count.

## Region
## Count North Perimeter South
## 0 15 12 9
## 1 1 1 6
## 2 0 0 1

Fitting models

glm(count ~ region, family = 'poisson')

Exact same syntax as before, except the "family" is Poisson.

32 / 37

Count data

Lots of 0's, some 1's, and just one 2 count.

## Region
## Count North Perimeter South
## 0 15 12 9
## 1 1 1 6
## 2 0 0 1

Fitting models

Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.773 1.000 -2.773 0.006
RegionPerimeter 0.208 1.414 0.147 0.883
RegionSouth 2.079 1.061 1.961 0.050

The intercept here is "North", the p-values compare with North. So South has - borderline - significantly more

ΔAIC table

df AIC
Null.model 1 53.47
Region.model 3 49.15

Model that includes Region has lower AIC

33 / 37

Making predictions

Region area fit se.fit l.hat l.low l.high d.hat d.low d.high N.hat N.low N.high
South 82 -0.693 0.354 0.500 0.247 1.014 1.000 0.493 2.028 82.0 40.4 166.3
North 82 -2.773 1.000 0.063 0.008 0.462 0.125 0.017 0.924 10.2 1.4 75.8
Perimeter 196 -2.565 1.000 0.077 0.010 0.568 0.154 0.021 1.137 30.2 4.1 222.9
  • fit and se.fit are in the log scale, so they need to be transformed via exp to intensities λ.

  • l.hat is the Poisson intensity λ of the sampling square (hula hoop), which we turn into an actual density by dividing by its area 0.5 m2.

  • d.hat (and d.low and d.high) are the density estimates & confidence intervals, which we then turn into our numerical predictions by multiplying by area.

Total estimate

ˆN=122.4(95%C.I.:71.4173.4)

pretty darned good! The true values were 92 total [58 S, 29 N, 13 perimeter]

34 / 37

Take-aways on (linear, statistical) modeling

  1. Linear modeling separates patterns (the model) from "randomness" (unexplained variation).

  2. We structure our models to have a response variable and one or more predictors or covariates.

  3. Depending on the reponse variable, a different family is chosen:

    • if continuous and symmetric: Normal family
    • if two values (presence/absence, dead/alive): Binomial family
    • if count data: Poisson family.
  4. An important task is Model selection, identifying which model is "best"

    • Best means "explains the most variation without overfitting"
    • Very common criterion is AIC.
  5. Once a model is "selected", we can:

    • analyze the results by seeing the effect sizes (magnitude of coefficients, aka slopes) and directions (signs of coefficients)
    • make inferential predictions by "spreading" our model over a larger landscape.
  6. Well over 90% of habitat modeling is done this way!

35 / 37

extra slides

36 / 37

Some comments on linear models

Yiα+βXi+ϵi

  1. ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter
37 / 37

Some comments on linear models

Yiα+βXi+ϵi

  1. ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter

  2. A better, more sophisticated way to think of this model is not to focus on isolating the residual variance, but that the whole process is a random variable: YiN(α+βXi,σ) This is better because: (a) the three parameters ( α,β,σ ) are more clearly visible, (b) it can be "generalized". For example the Normal distribution can be a Bernoulli distribution (for binary data), or a Poisson distribution for count data, etc.

37 / 37

Some comments on linear models

Yiα+βXi+ϵi

  1. ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter

  2. A better, more sophisticated way to think of this model is not to focus on isolating the residual variance, but that the whole process is a random variable: YiN(α+βXi,σ) This is better because: (a) the three parameters ( α,β,σ ) are more clearly visible, (b) it can be "generalized". For example the Normal distribution can be a Bernoulli distribution (for binary data), or a Poisson distribution for count data, etc.

  3. α+βXi is the predictor, or the "modeled" portion. There can be any number of variables in the predictor and they can have different powers, so: YiN(α+βXi+γZi+δX2i+νXiZi,σ) is also a linear model.

37 / 37

Super fast primer on statistical modeling

Everything you need to know to do 95% of all wildlife modeling in less than an hour and FOUR (or FIVE) easy steps!!

I. Linear modeling

II. Multivariate modeling

III. Model selection

IV. Generalized linear modeling

  • Poisson; Binomial

V. Prediction

2 / 37
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
sToggle scribble toolbox
oTile View: Overview of Slides
Esc Back to slideshow