Everything you need to know to do 95% of all wildlife modeling in less than an hour and FOUR (or FIVE) easy steps!!
I. Linear modeling
II. Multivariate modeling
III. Model selection
IV. Generalized linear modeling
V. Prediction
... is a very general method to quantifying relationships among variables.
Xi - is called:
Yi - is the property we are interested in modeling:
Note: There actually can be interest in wildlife studies to have models for length and weight, since length is easy to measure (e.g. from drones), but weight tells us more about physical condition and energetics.
Steller sea lion (Eumatopias jubatus) pups.
Yi=a+bXi
a - intercept; b - slope
Yi=α+βXi+ϵi
α - intercept; β - slope; ϵ - randomness!: ϵi∼N(0,σ)
Point Estimate
This command fits a model:
lm(Weight ~ Length, data = pups)
## ## Call:## lm(formula = Weight ~ Length, data = pups)## ## Coefficients:## (Intercept) Length ## -49.1422 0.7535
So for each 1 cm of length, add another 754 grams, i.e. (\widehat{\beta} = 0.754)
plot(Weight ~ Length, data = pups)abline(my_model)
The abline
puts a line, with intercept a
and slope b
onto a figure.
Statistical inference is the science / art of observings something from a portion of a population and making statements about the entire population.
In practice - this is done by taking data and estimating parameters of a model. (This is also called fitting a model).
Two related goals:
This is (almost always) done using Maximum Likelihood Estimation, i.e. an algorithm searches through possible values of the parameters that make the model MOST LIKELY (have the highest probability) given the data.
Another gratuitous sea lion picture.
Y=f(X|Θ)
X - predictor(s) | independent variable(s) | covariate(s). The thing(s) that "explain(s)" Y. The cause (maybe).
f - the model structure. This includes: some deterministic functional form form (linear? periodic? polynomial? exponential?) AND some probabilistic assumptions, i.e. a way to characterize the variability / randomness / unpredictability of the process.
Θ - the parameters of the model. There are usually some parameters associated with the predictors, and some associated with the random bit.
Y=f(X|Θ)
What are the best Θ values given f,X,Y?
Fitting the model = estimating the parameters.
Usually according to some criterion (almost always Maximum Likelihood.
Y=f(X|Θ)
What are the best Θ values given f,X,Y?
Fitting the model = estimating the parameters.
Usually according to some criterion (almost always Maximum Likelihood.
What are the best of a set of models f1, f2, f3 given X and Y?
Different models usually vary by what particular variables go into X, but can also vary by functional form and distribution assumptions
Use some Criterion (e.g. AIC) to "select" the best model, which balances how many parameters you estimated verses how good the fit is.
Q: What is the "best model" for squirrel morph distribution?
Y=f(X|Θ)
Two ways to write the same thing:
Yi=α+βXi+ϵi Yi∼N(α+βXi,σ)
Response Y - Weight
Predictor X - Length
Parameters Θ: α; β; σ
Function f(): Normal distribution N()
Yi∼N(α+βXi,σ)
``` ## ## Call: ## lm(formula = Weight ~ Length, data = pups %>% subset(Island == ## "Raykoke")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.498 -1.718 0.023 1.764 7.276 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -49.14222 5.75796 -8.535 1.81e-13 *** ## Length 0.75345 0.05193 14.510 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.761 on 98 degrees of freedom ## Multiple R-squared: 0.6824, Adjusted R-squared: 0.6791 ## F-statistic: 210.5 on 1 and 98 DF, p-value: < 2.2e-16 ```
Yi∼N(α+βXi,σ)
``` ## ## Call: ## lm(formula = Weight ~ Length, data = pups %>% subset(Island == ## "Raykoke")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.498 -1.718 0.023 1.764 7.276 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -49.14222 5.75796 -8.535 1.81e-13 *** ## Length 0.75345 0.05193 14.510 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.761 on 98 degrees of freedom ## Multiple R-squared: 0.6824, Adjusted R-squared: 0.6791 ## F-statistic: 210.5 on 1 and 98 DF, p-value: < 2.2e-16 ```
Intercept ( α ): −49.14±11.5
Slope ( β ): 0.75±0.104
p-values are very very small, in particular for slope
Proportion of variance explained is high:
R2=0.68
Every p-value is a Hypothesis test.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -49.142 | 5.758 | -8.535 | 0 |
Length | 0.753 | 0.052 | 14.510 | 0 |
Both null-hypotheses strongly rejected.
Lots of competing models with different main and interaction effects.
Yij=α+βi+ϵij
i is the index of sex (Male or Female), so there are two "Sex effects" - β1 and β2 representing the effect of the sex group; j is the index of the individual within each sex group i.
lm(Weight ~ Sex, data = pups)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 30.151 | 0.317 | 95.119 | <2e-16 |
SexMale | 6.149 | 0.429 | 14.337 | <2e-16 |
Intercept here means mean female weight.
Note - this is very similar to a t-test comparing two means (baby stats).
Very easy to extend this to more complicated models!
Yijk=α+βiIslandijk+γjSexijk+ϵijk
lm(Weight ~ Island + Sex, data = pups)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 31.04 | 0.54 | 57.62 | <1e-16 |
IslandChirpoev | -2.23 | 0.67 | -3.34 | 0.001 |
IslandLovushki | -0.84 | 0.67 | -1.26 | 0.21 |
IslandRaykoke | 0.14 | 0.67 | 0.21 | 0.83 |
IslandSrednova | -1.50 | 0.67 | -2.24 | 0.03 |
SexMale | 6.14 | 0.42 | 14.47 | 1e-16 |
Is a technique for seeing which effect in a model is significant. Each row tests a hypothesis that the effect coefficients are non-zero.
In this model, we include an interaction, asking: "Do different Islands have different patterns among Sexes? (and vice versa)"
lm(Weight ~ Island * Sex, data = pups)
Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Island 4 443.3 110.8 5.0114 0.0005763 *** Sex 1 4623.9 4623.9 209.0758 < 2.2e-16 *** Island:Sex 4 71.4 17.9 0.8075 0.5207439 Residuals 488 10792.6 22.1 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Non-significant interaction term
Exploratory plot
Exploratory plot
Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Length 1 12413.8 12413.8 1957.969 < 2.2e-16 *** Sex 1 257.3 257.3 40.582 4.321e-10 *** Length:Sex 1 128.1 128.1 20.208 8.662e-06 *** Residuals 494 3132.0 6.3 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highly significant interaction term.
ANOVA is helpful for "nested" models, where each one is a subset of another more complex one. For comparing a set of competing,non-nested models, we use .
Model | k | R2 | logLik | AIC | dAIC | |
---|---|---|---|---|---|---|
M0 | 1 | 1 | 0.000 | -1569.5 | 3143.1 | 835.8 |
M1 | Island | 5 | 0.028 | -1562.5 | 3137.0 | 829.7 |
M2 | Sex | 2 | 0.293 | -1483.2 | 2972.4 | 665.1 |
M3 | Length | 2 | 0.779 | -1193.4 | 2392.8 | 85.5 |
M4 | Length + Sex | 3 | 0.795 | -1174.5 | 2357.0 | 49.7 |
M6 | Length * Sex | 4 | 0.803 | -1164.5 | 2339.0 | 31.7 |
M5 | Length + Sex + Island | 7 | 0.811 | -1155.0 | 2325.9 | 18.6 |
M7 | Length * Sex + Island | 8 | 0.818 | -1144.6 | 2307.3 | 0.0 |
M8 | Length * Sex * Island | 20 | 0.824 | -1137.1 | 2316.1 | 8.8 |
Degrees of freedom k:
Coefficient of determination R2:
log-likelihood log(L):
Akaike Information Criterion:
Quality (Nitrogen)? or Type (browse/grass)?
Remarkable temporal synchrony at a continental scale.
Pacific Decadal Oscillation, Arctic Oscillation, North Atlantic Oscillation: determine whether the winter is wet & snowy or dry & cold.
... driven by LARGE climate oscillations.
... completely independent of climate!
Yi∼Normal(α0+β1Xi,σ) Models continuous data with a "normal-like" distribution.
Yi∼Normal(α0+β1Xi,σ) Models continuous data with a "normal-like" distribution.
Yi∼Bernoulli(exp(α+βXi)1+exp(α+βXi))
There's some probability of something happening that depends on the predictor X.
Bernoulli just means the data are all 0 or 1.
This models presence/absence, dead/alive, male/female other response variables with 2 possible outcomes.
Sampled in the estuary of the Tejo river in Portugal
depth | temp | salinity | transp | gravel | large_sand | fine_sand | mud | presence |
---|---|---|---|---|---|---|---|---|
3.0 | 20 | 30 | 15 | 3.74 | 13.15 | 11.93 | 71.18 | 0 |
2.6 | 18 | 29 | 15 | 1.94 | 4.99 | 5.43 | 87.63 | 0 |
2.6 | 19 | 30 | 15 | 2.88 | 8.98 | 16.85 | 71.29 | 1 |
2.1 | 20 | 29 | 15 | 11.06 | 11.96 | 21.95 | 55.03 | 0 |
3.2 | 20 | 30 | 15 | 9.87 | 28.60 | 19.49 | 42.04 | 0 |
3.5 | 20 | 32 | 7 | 32.45 | 7.39 | 9.43 | 50.72 | 0 |
Modeling is EXACTLY the same as linear regression except:
glm
- for generalized linear model (instead of lm
)family = 'binomial'
is the instruction to fit the logistic regressionglm(presence ~ salinity, family ='binomial')
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 2.661 | 0.902 | 2.951 | 0.003 |
salinity | -0.130 | 0.035 | -3.716 | 0.000 |
Clearly - Solea solea presence is very significantly negatively related to salinity.
Model | k | logLik | AIC | dAIC | |
---|---|---|---|---|---|
M9 | salinity + gravel | 3 | -33.2 | 72.5 | 0.0 |
M2 | salinity | 2 | -34.3 | 72.6 | 0.1 |
M7 | temp + salinity | 3 | -34.0 | 74.0 | 1.5 |
M5 | depth + salinity | 3 | -34.1 | 74.3 | 1.8 |
M11 | depth + temp + salinity | 4 | -33.9 | 75.8 | 3.3 |
M0 | depth | 2 | -38.1 | 80.1 | 7.6 |
M4 | depth + temp | 3 | -38.0 | 81.9 | 9.4 |
M6 | depth + gravel | 3 | -38.0 | 82.0 | 9.5 |
M10 | depth + temp + gravel | 4 | -37.8 | 83.7 | 11.2 |
M1 | temp | 2 | -43.3 | 90.6 | 18.1 |
M3 | gravel | 2 | -43.7 | 91.3 | 18.8 |
M8 | temp + gravel | 3 | -43.3 | 92.6 | 20.1 |
Salinity clearly among the more important covariates (in the top 4 models).
Yi∼Poisson(λ=exp(α+βXi))
Did flag densities vary with region?
Approximate areas:
region | area |
---|---|
North: | 82 m2 |
South: | 82 m2 |
Perimeter: | 196 m2 |
-- | -- |
Sampling square (hula hoop) | 0.5 m2 |
Lots of 0's, some 1's, and just one 2 count.
## Region## Count North Perimeter South## 0 15 12 9## 1 1 1 6## 2 0 0 1
glm(count ~ region, family = 'poisson')
Exact same syntax as before, except the "family" is Poisson.
Lots of 0's, some 1's, and just one 2 count.
## Region## Count North Perimeter South## 0 15 12 9## 1 1 1 6## 2 0 0 1
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.773 | 1.000 | -2.773 | 0.006 |
RegionPerimeter | 0.208 | 1.414 | 0.147 | 0.883 |
RegionSouth | 2.079 | 1.061 | 1.961 | 0.050 |
The intercept here is "North", the p-values compare with North. So South has - borderline - significantly more
df | AIC | |
---|---|---|
Null.model | 1 | 53.47 |
Region.model | 3 | 49.15 |
Model that includes Region has lower AIC
Region | area | fit | se.fit | l.hat | l.low | l.high | d.hat | d.low | d.high | N.hat | N.low | N.high |
---|---|---|---|---|---|---|---|---|---|---|---|---|
South | 82 | -0.693 | 0.354 | 0.500 | 0.247 | 1.014 | 1.000 | 0.493 | 2.028 | 82.0 | 40.4 | 166.3 |
North | 82 | -2.773 | 1.000 | 0.063 | 0.008 | 0.462 | 0.125 | 0.017 | 0.924 | 10.2 | 1.4 | 75.8 |
Perimeter | 196 | -2.565 | 1.000 | 0.077 | 0.010 | 0.568 | 0.154 | 0.021 | 1.137 | 30.2 | 4.1 | 222.9 |
fit and se.fit are in the log scale, so they need to be transformed via exp to intensities λ.
l.hat
is the Poisson intensity λ of the sampling square (hula hoop), which we turn into an actual density by dividing by its area 0.5 m2.
d.hat
(and d.low
and d.high
) are the density estimates & confidence intervals, which we then turn into our numerical predictions by multiplying by area.
ˆN=122.4(95%C.I.:71.4−173.4)
pretty darned good! The true values were 92 total [58 S, 29 N, 13 perimeter]
Linear modeling separates patterns (the model) from "randomness" (unexplained variation).
We structure our models to have a response variable and one or more predictors or covariates.
Depending on the reponse variable, a different family is chosen:
An important task is Model selection, identifying which model is "best"
Once a model is "selected", we can:
Well over 90% of habitat modeling is done this way!
Yi∼α+βXi+ϵi
Yi∼α+βXi+ϵi
ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter
A better, more sophisticated way to think of this model is not to focus on isolating the residual variance, but that the whole process is a random variable: Yi∼N(α+βXi,σ) This is better because: (a) the three parameters ( α,β,σ ) are more clearly visible, (b) it can be "generalized". For example the Normal distribution can be a Bernoulli distribution (for binary data), or a Poisson distribution for count data, etc.
Yi∼α+βXi+ϵi
ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter
A better, more sophisticated way to think of this model is not to focus on isolating the residual variance, but that the whole process is a random variable: Yi∼N(α+βXi,σ) This is better because: (a) the three parameters ( α,β,σ ) are more clearly visible, (b) it can be "generalized". For example the Normal distribution can be a Bernoulli distribution (for binary data), or a Poisson distribution for count data, etc.
α+βXi is the predictor, or the "modeled" portion. There can be any number of variables in the predictor and they can have different powers, so: Yi∼N(α+βXi+γZi+δX2i+νXiZi,σ) is also a linear model.
Everything you need to know to do 95% of all wildlife modeling in less than an hour and FOUR (or FIVE) easy steps!!
I. Linear modeling
II. Multivariate modeling
III. Model selection
IV. Generalized linear modeling
V. Prediction
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
s | Toggle scribble toolbox |
o | Tile View: Overview of Slides |
Esc | Back to slideshow |
Everything you need to know to do 95% of all wildlife modeling in less than an hour and FOUR (or FIVE) easy steps!!
I. Linear modeling
II. Multivariate modeling
III. Model selection
IV. Generalized linear modeling
V. Prediction
... is a very general method to quantifying relationships among variables.
Xi - is called:
Yi - is the property we are interested in modeling:
Note: There actually can be interest in wildlife studies to have models for length and weight, since length is easy to measure (e.g. from drones), but weight tells us more about physical condition and energetics.
Steller sea lion (Eumatopias jubatus) pups.
Yi=a+bXi
a - intercept; b - slope
Yi=α+βXi+ϵi
α - intercept; β - slope; ϵ - randomness!: ϵi∼N(0,σ)
Point Estimate
This command fits a model:
lm(Weight ~ Length, data = pups)
## ## Call:## lm(formula = Weight ~ Length, data = pups)## ## Coefficients:## (Intercept) Length ## -49.1422 0.7535
So for each 1 cm of length, add another 754 grams, i.e. (\widehat{\beta} = 0.754)
plot(Weight ~ Length, data = pups)abline(my_model)
The abline
puts a line, with intercept a
and slope b
onto a figure.
Statistical inference is the science / art of observings something from a portion of a population and making statements about the entire population.
In practice - this is done by taking data and estimating parameters of a model. (This is also called fitting a model).
Two related goals:
This is (almost always) done using Maximum Likelihood Estimation, i.e. an algorithm searches through possible values of the parameters that make the model MOST LIKELY (have the highest probability) given the data.
Another gratuitous sea lion picture.
Y=f(X|Θ)
X - predictor(s) | independent variable(s) | covariate(s). The thing(s) that "explain(s)" Y. The cause (maybe).
f - the model structure. This includes: some deterministic functional form form (linear? periodic? polynomial? exponential?) AND some probabilistic assumptions, i.e. a way to characterize the variability / randomness / unpredictability of the process.
Θ - the parameters of the model. There are usually some parameters associated with the predictors, and some associated with the random bit.
Y=f(X|Θ)
What are the best Θ values given f,X,Y?
Fitting the model = estimating the parameters.
Usually according to some criterion (almost always Maximum Likelihood.
Y=f(X|Θ)
What are the best Θ values given f,X,Y?
Fitting the model = estimating the parameters.
Usually according to some criterion (almost always Maximum Likelihood.
What are the best of a set of models f1, f2, f3 given X and Y?
Different models usually vary by what particular variables go into X, but can also vary by functional form and distribution assumptions
Use some Criterion (e.g. AIC) to "select" the best model, which balances how many parameters you estimated verses how good the fit is.
Q: What is the "best model" for squirrel morph distribution?
Y=f(X|Θ)
Two ways to write the same thing:
Yi=α+βXi+ϵi Yi∼N(α+βXi,σ)
Response Y - Weight
Predictor X - Length
Parameters Θ: α; β; σ
Function f(): Normal distribution N()
Yi∼N(α+βXi,σ)
``` ## ## Call: ## lm(formula = Weight ~ Length, data = pups %>% subset(Island == ## "Raykoke")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.498 -1.718 0.023 1.764 7.276 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -49.14222 5.75796 -8.535 1.81e-13 *** ## Length 0.75345 0.05193 14.510 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.761 on 98 degrees of freedom ## Multiple R-squared: 0.6824, Adjusted R-squared: 0.6791 ## F-statistic: 210.5 on 1 and 98 DF, p-value: < 2.2e-16 ```
Yi∼N(α+βXi,σ)
``` ## ## Call: ## lm(formula = Weight ~ Length, data = pups %>% subset(Island == ## "Raykoke")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -7.498 -1.718 0.023 1.764 7.276 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -49.14222 5.75796 -8.535 1.81e-13 *** ## Length 0.75345 0.05193 14.510 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.761 on 98 degrees of freedom ## Multiple R-squared: 0.6824, Adjusted R-squared: 0.6791 ## F-statistic: 210.5 on 1 and 98 DF, p-value: < 2.2e-16 ```
Intercept ( α ): −49.14±11.5
Slope ( β ): 0.75±0.104
p-values are very very small, in particular for slope
Proportion of variance explained is high:
R2=0.68
Every p-value is a Hypothesis test.
Estimate | Std. Error | t value | Pr(>|t|) | |
---|---|---|---|---|
(Intercept) | -49.142 | 5.758 | -8.535 | 0 |
Length | 0.753 | 0.052 | 14.510 | 0 |
Both null-hypotheses strongly rejected.
Lots of competing models with different main and interaction effects.
Yij=α+βi+ϵij
i is the index of sex (Male or Female), so there are two "Sex effects" - β1 and β2 representing the effect of the sex group; j is the index of the individual within each sex group i.
lm(Weight ~ Sex, data = pups)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 30.151 | 0.317 | 95.119 | <2e-16 |
SexMale | 6.149 | 0.429 | 14.337 | <2e-16 |
Intercept here means mean female weight.
Note - this is very similar to a t-test comparing two means (baby stats).
Very easy to extend this to more complicated models!
Yijk=α+βiIslandijk+γjSexijk+ϵijk
lm(Weight ~ Island + Sex, data = pups)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 31.04 | 0.54 | 57.62 | <1e-16 |
IslandChirpoev | -2.23 | 0.67 | -3.34 | 0.001 |
IslandLovushki | -0.84 | 0.67 | -1.26 | 0.21 |
IslandRaykoke | 0.14 | 0.67 | 0.21 | 0.83 |
IslandSrednova | -1.50 | 0.67 | -2.24 | 0.03 |
SexMale | 6.14 | 0.42 | 14.47 | 1e-16 |
Is a technique for seeing which effect in a model is significant. Each row tests a hypothesis that the effect coefficients are non-zero.
In this model, we include an interaction, asking: "Do different Islands have different patterns among Sexes? (and vice versa)"
lm(Weight ~ Island * Sex, data = pups)
Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Island 4 443.3 110.8 5.0114 0.0005763 *** Sex 1 4623.9 4623.9 209.0758 < 2.2e-16 *** Island:Sex 4 71.4 17.9 0.8075 0.5207439 Residuals 488 10792.6 22.1 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Non-significant interaction term
Exploratory plot
Exploratory plot
Analysis of Variance Table Response: Weight Df Sum Sq Mean Sq F value Pr(>F) Length 1 12413.8 12413.8 1957.969 < 2.2e-16 *** Sex 1 257.3 257.3 40.582 4.321e-10 *** Length:Sex 1 128.1 128.1 20.208 8.662e-06 *** Residuals 494 3132.0 6.3 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Highly significant interaction term.
ANOVA is helpful for "nested" models, where each one is a subset of another more complex one. For comparing a set of competing,non-nested models, we use .
Model | k | R2 | logLik | AIC | dAIC | |
---|---|---|---|---|---|---|
M0 | 1 | 1 | 0.000 | -1569.5 | 3143.1 | 835.8 |
M1 | Island | 5 | 0.028 | -1562.5 | 3137.0 | 829.7 |
M2 | Sex | 2 | 0.293 | -1483.2 | 2972.4 | 665.1 |
M3 | Length | 2 | 0.779 | -1193.4 | 2392.8 | 85.5 |
M4 | Length + Sex | 3 | 0.795 | -1174.5 | 2357.0 | 49.7 |
M6 | Length * Sex | 4 | 0.803 | -1164.5 | 2339.0 | 31.7 |
M5 | Length + Sex + Island | 7 | 0.811 | -1155.0 | 2325.9 | 18.6 |
M7 | Length * Sex + Island | 8 | 0.818 | -1144.6 | 2307.3 | 0.0 |
M8 | Length * Sex * Island | 20 | 0.824 | -1137.1 | 2316.1 | 8.8 |
Degrees of freedom k:
Coefficient of determination R2:
log-likelihood log(L):
Akaike Information Criterion:
Quality (Nitrogen)? or Type (browse/grass)?
Remarkable temporal synchrony at a continental scale.
Pacific Decadal Oscillation, Arctic Oscillation, North Atlantic Oscillation: determine whether the winter is wet & snowy or dry & cold.
... driven by LARGE climate oscillations.
... completely independent of climate!
Yi∼Normal(α0+β1Xi,σ) Models continuous data with a "normal-like" distribution.
Yi∼Normal(α0+β1Xi,σ) Models continuous data with a "normal-like" distribution.
Yi∼Bernoulli(exp(α+βXi)1+exp(α+βXi))
There's some probability of something happening that depends on the predictor X.
Bernoulli just means the data are all 0 or 1.
This models presence/absence, dead/alive, male/female other response variables with 2 possible outcomes.
Sampled in the estuary of the Tejo river in Portugal
depth | temp | salinity | transp | gravel | large_sand | fine_sand | mud | presence |
---|---|---|---|---|---|---|---|---|
3.0 | 20 | 30 | 15 | 3.74 | 13.15 | 11.93 | 71.18 | 0 |
2.6 | 18 | 29 | 15 | 1.94 | 4.99 | 5.43 | 87.63 | 0 |
2.6 | 19 | 30 | 15 | 2.88 | 8.98 | 16.85 | 71.29 | 1 |
2.1 | 20 | 29 | 15 | 11.06 | 11.96 | 21.95 | 55.03 | 0 |
3.2 | 20 | 30 | 15 | 9.87 | 28.60 | 19.49 | 42.04 | 0 |
3.5 | 20 | 32 | 7 | 32.45 | 7.39 | 9.43 | 50.72 | 0 |
Modeling is EXACTLY the same as linear regression except:
glm
- for generalized linear model (instead of lm
)family = 'binomial'
is the instruction to fit the logistic regressionglm(presence ~ salinity, family ='binomial')
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | 2.661 | 0.902 | 2.951 | 0.003 |
salinity | -0.130 | 0.035 | -3.716 | 0.000 |
Clearly - Solea solea presence is very significantly negatively related to salinity.
Model | k | logLik | AIC | dAIC | |
---|---|---|---|---|---|
M9 | salinity + gravel | 3 | -33.2 | 72.5 | 0.0 |
M2 | salinity | 2 | -34.3 | 72.6 | 0.1 |
M7 | temp + salinity | 3 | -34.0 | 74.0 | 1.5 |
M5 | depth + salinity | 3 | -34.1 | 74.3 | 1.8 |
M11 | depth + temp + salinity | 4 | -33.9 | 75.8 | 3.3 |
M0 | depth | 2 | -38.1 | 80.1 | 7.6 |
M4 | depth + temp | 3 | -38.0 | 81.9 | 9.4 |
M6 | depth + gravel | 3 | -38.0 | 82.0 | 9.5 |
M10 | depth + temp + gravel | 4 | -37.8 | 83.7 | 11.2 |
M1 | temp | 2 | -43.3 | 90.6 | 18.1 |
M3 | gravel | 2 | -43.7 | 91.3 | 18.8 |
M8 | temp + gravel | 3 | -43.3 | 92.6 | 20.1 |
Salinity clearly among the more important covariates (in the top 4 models).
Yi∼Poisson(λ=exp(α+βXi))
Did flag densities vary with region?
Approximate areas:
region | area |
---|---|
North: | 82 m2 |
South: | 82 m2 |
Perimeter: | 196 m2 |
-- | -- |
Sampling square (hula hoop) | 0.5 m2 |
Lots of 0's, some 1's, and just one 2 count.
## Region## Count North Perimeter South## 0 15 12 9## 1 1 1 6## 2 0 0 1
glm(count ~ region, family = 'poisson')
Exact same syntax as before, except the "family" is Poisson.
Lots of 0's, some 1's, and just one 2 count.
## Region## Count North Perimeter South## 0 15 12 9## 1 1 1 6## 2 0 0 1
Estimate | Std. Error | z value | Pr(>|z|) | |
---|---|---|---|---|
(Intercept) | -2.773 | 1.000 | -2.773 | 0.006 |
RegionPerimeter | 0.208 | 1.414 | 0.147 | 0.883 |
RegionSouth | 2.079 | 1.061 | 1.961 | 0.050 |
The intercept here is "North", the p-values compare with North. So South has - borderline - significantly more
df | AIC | |
---|---|---|
Null.model | 1 | 53.47 |
Region.model | 3 | 49.15 |
Model that includes Region has lower AIC
Region | area | fit | se.fit | l.hat | l.low | l.high | d.hat | d.low | d.high | N.hat | N.low | N.high |
---|---|---|---|---|---|---|---|---|---|---|---|---|
South | 82 | -0.693 | 0.354 | 0.500 | 0.247 | 1.014 | 1.000 | 0.493 | 2.028 | 82.0 | 40.4 | 166.3 |
North | 82 | -2.773 | 1.000 | 0.063 | 0.008 | 0.462 | 0.125 | 0.017 | 0.924 | 10.2 | 1.4 | 75.8 |
Perimeter | 196 | -2.565 | 1.000 | 0.077 | 0.010 | 0.568 | 0.154 | 0.021 | 1.137 | 30.2 | 4.1 | 222.9 |
fit and se.fit are in the log scale, so they need to be transformed via exp to intensities λ.
l.hat
is the Poisson intensity λ of the sampling square (hula hoop), which we turn into an actual density by dividing by its area 0.5 m2.
d.hat
(and d.low
and d.high
) are the density estimates & confidence intervals, which we then turn into our numerical predictions by multiplying by area.
ˆN=122.4(95%C.I.:71.4−173.4)
pretty darned good! The true values were 92 total [58 S, 29 N, 13 perimeter]
Linear modeling separates patterns (the model) from "randomness" (unexplained variation).
We structure our models to have a response variable and one or more predictors or covariates.
Depending on the reponse variable, a different family is chosen:
An important task is Model selection, identifying which model is "best"
Once a model is "selected", we can:
Well over 90% of habitat modeling is done this way!
Yi∼α+βXi+ϵi
Yi∼α+βXi+ϵi
ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter
A better, more sophisticated way to think of this model is not to focus on isolating the residual variance, but that the whole process is a random variable: Yi∼N(α+βXi,σ) This is better because: (a) the three parameters ( α,β,σ ) are more clearly visible, (b) it can be "generalized". For example the Normal distribution can be a Bernoulli distribution (for binary data), or a Poisson distribution for count data, etc.
Yi∼α+βXi+ϵi
ϵi is unexplained variation or residual variance. It is often POORLY/WRONGLY referred to as "error". It is a random variable, NOT a parameter
A better, more sophisticated way to think of this model is not to focus on isolating the residual variance, but that the whole process is a random variable: Yi∼N(α+βXi,σ) This is better because: (a) the three parameters ( α,β,σ ) are more clearly visible, (b) it can be "generalized". For example the Normal distribution can be a Bernoulli distribution (for binary data), or a Poisson distribution for count data, etc.
α+βXi is the predictor, or the "modeled" portion. There can be any number of variables in the predictor and they can have different powers, so: Yi∼N(α+βXi+γZi+δX2i+νXiZi,σ) is also a linear model.