class: center, middle, white, title-slide .title[ # How to model just about anything
(Part I) ] .subtitle[ ## EFB 390: Wildlife Ecology and Management ] .author[ ### Dr. Elie Gurarie ] .date[ ### September 30, 2025 ] --- <!-- https://bookdown.org/yihui/rmarkdown/xaringan-format.html --> ## Super fast primer on statistical modeling Everything you need to know to do 95% of all wildlife modeling in less than an hour and **FOUR** (or **FIVE**) easy steps!! .pull-left.large[ **I.** Linear modeling **II.** Multivariate modeling **III.** Model selection ] .pull-right.large[ **IV.** Generalized linear modeling - Poisson; Binomial **V.** Prediction ] --- .pull-left-70[ # **Step I:** Linear modeling ... is a very general method to quantifying relationships among variables. .pull-left[ <!-- --> ] .pull-right[ `\(X_i\)` - is called: - covariate - independent variable - explanatory variable `\(Y_i\)` - is the property we are interested in modeling: - response variable - dependent variable .small[Note: There actually can be interest in wildlife studies to have models for **length** and **weight**, since **length** is easy to measure (e.g. from drones), but **weight** tells us more about physical condition and energetics.] ] ] .pull-right-30[  Steller sea lion (*Eumatopias jubatus*) pups. ] --- # Linear Models .pull-left[ #### Deterministic: `$$Y_i = a + bX_i$$` `\(a\)` - intercept; `\(b\)` - slope ] .pull-right[ #### Probabilistic: `$$Y_i = \alpha + \beta X_i + \epsilon_i$$` `\(\alpha\)` - intercept; `\(\beta\)` - slope; `\(\epsilon\)` - **randomness!**: `\(\epsilon_i \sim {\cal N}(0, \sigma)\)` ] .pull-left[ <!-- --> ] .pull-right[ <!-- --> ] --- # Fitting linear models is very easy in ! .pull-left[ **Point Estimate** This command fits a model: .small[ ``` r lm(Weight ~ Length, data = pups) ``` ``` ## ## Call: ## lm(formula = Weight ~ Length, data = pups) ## ## Coefficients: ## (Intercept) Length ## -49.1422 0.7535 ``` ] So for **each 1 cm** of length, add another **754 grams**, i.e. `\(\widehat{\beta} = 0.754\)` ] .pull-right[ ``` r plot(Weight ~ Length, data = pups) abline(my_model) ``` <!-- --> The `abline` puts a line, with intercept `a` and slope `b` onto a figure. ] --- .pull-left-60[ # Statistical inference **Statistical inference** is the *science / art* of observings *something* from a **portion of a population** and making statements about the **entire population**. In practice - this is done by taking **data** and **estimating parameters** of a **model**. (This is also called *fitting* a model). Two related goals: 1. obtaining a **point estimate** and a **confidence interval** (precision) of the parameter estimate. 2. Assessing whether particular (combinations of) factors, i.e. **models**, provide any **explanatory power**. This is (almost always) done using **Maximum Likelihood Estimation**, i.e. an algorithm searches through possible values of the parameters that make the model **MOST LIKELY** (have the highest probability) given the data. ] .pull-right-40[  .small[Another gratuitous sea lion picture.] ] --- # All models have these pieces: `$$\Huge Y = f({\bf X} | \bf{\Theta})$$` - **Y** - response | dependent variable. The thing we want to model / predict / understand. The **effect** (maybe). - **X** - predictor(s) | independent variable(s) | covariate(s). The thing(s) that "explain(s)" **Y**. The **cause** (maybe). - **f** - the model structure. This includes: some **deterministic functional form** form (*linear? periodic? polynomial? exponential?*) AND some **probabilistic assumptions**, i.e. a way to characterize the variability / randomness / unpredictability of the process. - `\(\bf \Theta\)` - the parameters of the model. There are usually some parameters associated with the **predictors**, and some associated with the **random bit**. --- # Goals (Art / Science) of Modeling `$$\Huge Y = f({\bf X} | \bf{\Theta})$$` .pull-left[ ### 1. Model fitting .darkred[ What are the **best** `\(\bf \Theta\)` values given `\(f, {\bf X}, Y\)`? ] **Fitting the model** = **estimating the parameters**. Usually according to some criterion (almost always **Maximum Likelihood**. ] -- .pull-right[ ### 2. Model selection .darkred[ What are the **best** of a set of models `\(f_1\)`, `\(f_2\)`, `\(f_3\)` given `\(\bf X\)` and `\(Y\)`? ] Different models *usually* vary by what particular variables go into **X**, but can also vary by **functional form** and **distribution assumptions** Use some **Criterion** (e.g. AIC) to "select" the best model, which balances **how many parameters you estimated** verses **how good the fit is**. ] --- # Whoa! What is "Maximum Likelihood"!? .pull-left[ ### Oakie  ] .pull-right[ ### Orange  ] .red.center.large[**Q:** What is the "best model" for squirrel morph distribution?] --- ## Data and Models .pull-left[ ### Data / observations: `\(X_{ij}\)` | island | a. Orange | b. Oakie| |---|---|---| | squirrel 1: | `\(X_{a,1}\)` = 1 | `\(X_{b,1}\)` = 1 | | squirrel 2: | `\(X_{a,2}\)` = 1 | `\(X_{b,2}\)` = 0 | - 1 = light morph - 0 = dark morph ] .pull-right[ ### Models | | model | k| |---|:--|:---| | M1: | `\(P(X_{ij} = 1) = p = 0.5\)` | 1 | M2: | `\(p = 0.75\)` | 1 | M3: | `\(p_a = 1\)`; `\(p_b = .5\)` | 2 | M4: | `\(p_{a,1} = 1;\,\,\, p_{b,1} = 1 \\ p_{a,2} = 1;\,\,\, p_{b,2} = 0\)` | 4 **very important to keep track of the number of parameters!** ] --- ## Likelihoood (of a model) .large[ **Product** of **probabilities** of **data** given **model**. `$$\Large {\cal L}(model) = \prod_{i = 1}^n \text{Pr}(X | model)$$` - We **never** care about the **absolute** value of the likelihood! - Only the *relative* value of the likelihood. ] --- ## Four different Squirrel Models: Data: `$$X_{a1} = 1;\, X_{a2} = 1; X_{b1} = 1;\, X_{b2} = 0$$` | model | | likelihood | | |---|---|---|---| | M1 | `\(p = 0.5\)` | `\({1\over2} \times {1\over2} \times {1\over2} \times {1\over2}\)` | 0.0625 | | M2 | `\(p = 0.75\)` | `\({3\over4} \times {3\over4} \times {3\over4} \times {1\over4}\)` | 0.1055 | | M3 | `\(p_a = .5\)`; `\(p_b = 1\)` | `\(1\times1\times{1\over2}\times{1\over2}\)` | 0.25 | | M4 | `\(p_{a,1} = 1;\, p_{b,1} = 1; \, p_{a,2} = 1;\, p_{b,2} = 0\)` | `\(1\times1\times1\times1\)` | 1 | `$$\cal{L}(M4) > \cal{L}(M3) > \cal{L}(M2) > \cal{L}(M1)$$` .center[*M4** has the higest likelihood! But is this a useful model?] --- ## A(kaike) Information Criterion A good fit is great! But it is useless if it uses too much information (too many parameters). This is *overfitting*. **One parameter per data point is TOO MANY parameters!** .pull-left[ Hirotugo Akaike 赤池 弘次 (1927-2006) ] .pull-right[ Simple formula: `$$AIC = -2 \log({\cal L})+ 2k$$` (where `\(k\)` is the number of parameters) - Better fit = higher `\(\cal L\)` = lower AIC. - Too complicated = more k = higher AIC. **Lowest AIC is "best" model** ] --- ## Compute AIC | model | likelihood | log-likelihood | k | AIC |---|---|---|---|---| | **M1:** coin flip | 0.0625 | -2.77 | 1 | 7.55 | **M2:** proportional odds| 0.1055 | -2.25 | 1 | **6.50** | **M3:** island specific| 0.25 | -1.39 | 2 | 6.77 | **M4:** individual specific| 1 | 0 | 4 | 8 .center[AIC2 < AIC3 < AIC1 < AIC4 .large[Most *parsimonious* model is M2!] **Conclusion:** not enough evidence to identify a difference between islands. ] --- ## Let's add one more observation ... .pull-left[ ### Oakie Island  ] .pull-right[ ### Orange Island  ] `$$X_{b,3} = dark$$` --- ## Updated squirrel models: model | probs | Likelihood | | k | AIC| ---|---|---|---|---|---| M1 | `\(p = {1\over2}\)` | `\({1\over2} \times {1\over2} \times {1\over2} \times {1\over2} \times {1\over2}\)` | = 0.03125 | 1 | 8.93 M2 | `\(p = {3\over 4}\)` | `\({3\over4} \times {3\over4} \times {3\over4} \times {1\over4} \times {1\over4}\)` | = 0.02637 | 1 | 9.27 M2b | `\(p = {3\over 5}\)` | `\({3\over5} \times {3\over5} \times {3\over5} \times {2\over5} \times {2\over5}\)` | = 0.0346 | 1 | 8.73 M3 | `\(p_a = .5; p_b = 1\)` | `\(1\times1\times{1\over2}\times{1\over2}\times0\)` | = 0 (!!) | 2 | `\(\infty\)` M3b | `\(p_a = {1\over2}; p_b = {2\over3}\)` | `\({1\over2} \times {1\over2} \times {2\over3} \times {2\over3} \times {1\over3}\)` | = 0.037 | 2 | 10.6 --- ## Updated (1 parameter) squirrel models: .pull-left[ model | probs | | `\({\cal L}\)` ---|---|---|---| M1 | `\(p = {1\over2}\)` | `\({1\over2} \times {1\over2} \times {1\over2} \times {1\over2} \times {1\over2}\)` | 0.03125 M2 | `\(p = {3\over 4}\)` | `\({3\over4} \times {3\over4} \times {3\over4} \times {1\over4} \times {1\over4}\)` | 0.02637 M2b | `\(p = {3\over 5}\)` | `\({3\over5} \times {3\over5} \times {3\over5} \times {2\over5} \times {2\over5}\)` | 0.0346 If you sweep through all possible values of `\(p\)`, you find that `\(\widehat p = 3/5\)` leads to the highest likelihood. This is the **maximum likelihood estimate** (MLE) of the probability that a squirrel is light morph. But you can also get (good) Confidence Intervals from looking at the curve of the profile. ] .pull-right[ ### Likelihood profile <!-- --> ] --- # **Step II:** More Models of Sea Lion Weights .pull-left-70[ ### Null (linear) model <!-- --> ``` r mean(pups$Weight) ``` ``` ## [1] 33.51004 ``` ``` r sd(pups$Weight) ``` ``` ## [1] 5.661695 ``` ] .pull-right-30[  This suggests a model! `$$W \sim {\cal N}(\mu = 33\,kg, \sigma = 5.7)$$` With no covariates. ] --- .pull-left-70[ # Simple linear model *Probably* there is a relationship between length and weight. The simplest relationship is linear. `$$\large Y \sim { N (\text{mean} = \beta_0 + \beta_1 X,\,\, \text{sd} = \sigma)}$$` <!-- --> ] .pull-right-30[  Steller sea lion (*Eumatopias jubatus*) pups. ] --- .pull-left[ ## Deterministic model: `$$Y_i = \beta_0 + \beta_1 X_i$$` - `\(\beta_0\)` - intercept - `\(\beta_1\)` - slope This is the **functional form of the predictor** ] .pull-right[ ## Statistical model: **Version 1:** `$$Y_i = \beta_0 + \beta_1 X_i + \epsilon_i$$` where `\(\epsilon_i \sim {\cal N}(0, \sigma)\)` or **Version 2:** `$$Y_i \sim {\cal N}(\beta_0 + \beta_1 X_i, \sigma)$$` ] V2 is better because it is more transparent about the number of parameters! - Two (intercept | slope) are part of the **functional form** - One (residual standard deviation) is part of the **random component**. --- ## But other variables might influence pup size .pull-left-30[ Lots of competing models with different **main** and **interaction** effects. ] .pull-right-70[ <!-- --> ] --- ## Fitting and model selection .pull-left-70[ <table class="table" style="color: black; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> Model </th> <th style="text-align:right;"> k </th> <th style="text-align:right;"> R2 </th> <th style="text-align:right;"> logLik </th> <th style="text-align:right;"> AIC </th> <th style="text-align:right;"> dAIC </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;font-weight: bold;background-color: yellow !important;"> Weight ~ Length * Sex + Island </td> <td style="text-align:right;font-weight: bold;background-color: yellow !important;"> 8 </td> <td style="text-align:right;font-weight: bold;background-color: yellow !important;"> 0.818 </td> <td style="text-align:right;font-weight: bold;background-color: yellow !important;"> -1144.6 </td> <td style="text-align:right;font-weight: bold;background-color: yellow !important;"> 2307.3 </td> <td style="text-align:right;font-weight: bold;background-color: yellow !important;"> 0.0 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Length * Sex * Island </td> <td style="text-align:right;"> 20 </td> <td style="text-align:right;"> 0.824 </td> <td style="text-align:right;"> -1137.1 </td> <td style="text-align:right;"> 2316.1 </td> <td style="text-align:right;"> 8.8 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Length + Sex + Island </td> <td style="text-align:right;"> 7 </td> <td style="text-align:right;"> 0.811 </td> <td style="text-align:right;"> -1155.0 </td> <td style="text-align:right;"> 2325.9 </td> <td style="text-align:right;"> 18.6 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Length * Sex </td> <td style="text-align:right;"> 4 </td> <td style="text-align:right;"> 0.803 </td> <td style="text-align:right;"> -1164.5 </td> <td style="text-align:right;"> 2339.0 </td> <td style="text-align:right;"> 31.7 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Length + Sex </td> <td style="text-align:right;"> 3 </td> <td style="text-align:right;"> 0.795 </td> <td style="text-align:right;"> -1174.5 </td> <td style="text-align:right;"> 2357.0 </td> <td style="text-align:right;"> 49.7 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Length </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.779 </td> <td style="text-align:right;"> -1193.4 </td> <td style="text-align:right;"> 2392.8 </td> <td style="text-align:right;"> 85.5 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Sex </td> <td style="text-align:right;"> 2 </td> <td style="text-align:right;"> 0.293 </td> <td style="text-align:right;"> -1483.2 </td> <td style="text-align:right;"> 2972.4 </td> <td style="text-align:right;"> 665.1 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ Island </td> <td style="text-align:right;"> 5 </td> <td style="text-align:right;"> 0.028 </td> <td style="text-align:right;"> -1562.5 </td> <td style="text-align:right;"> 3137.0 </td> <td style="text-align:right;"> 829.7 </td> </tr> <tr> <td style="text-align:left;"> Weight ~ 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -1569.5 </td> <td style="text-align:right;"> 3143.1 </td> <td style="text-align:right;"> 835.8 </td> </tr> </tbody> </table> ] .pull-right-30[ This is what we expect ... the interaction between **sex** and **length** is consistent across islands, but there are some main effect differences across islands (mainly because of the time we sampled). ] --- ## Model selection vs. parameter estimates The best model: `$$Y_{ijk} = \beta_{sex} + \beta_{island} \text{Island}_{ijk} + (\beta_{length} \times\text{Length}_{ijk}) + \epsilon_{ijk}$$` .pull-left[ What are the **parameter estimates** (effect sizes) of the selected model? .small[ |term | estimate| std.error| statistic| p.value| |:--------------|--------:|---------:|---------:|-------:| |SexFemale | -39.34| 3.69| -10.67| 0.00| |SexMale | -59.29| 3.06| -19.37| 0.00| |Length | 0.66| 0.03| 19.11| 0.00| |IslandChirpoev | -2.00| 0.34| -5.81| 0.00| |IslandLovushki | -0.47| 0.34| -1.35| 0.18| |IslandRaykoke | -0.45| 0.35| -1.31| 0.19| |IslandSrednova | -0.35| 0.35| -1.00| 0.32| |SexMale:Length | 0.20| 0.04| 4.55| 0.00| ]] .pull-right[ <!-- --> ]