Goals

  1. Load data
  2. Understand continuous and discrete data
  3. summary statistics and visualizations
  4. Fit and analyze a simple linear model

There are 4 Exercises in the lab that need to be submitted on Blackboard.

1 Loading and working with data frames

The following examples should explain how to import data frames and to work with the data contained within them.

1.1 Loading Data

We will use Steller sea lion (Eumotopias jubatus) data as an example. These are weights, lengths, and girths (basically, under the arm/flipper pits) of sea lion pups about two months after birth as part of a tagging mark-recapture study. These data were collected (in part by Dr. Gurarie) on five islands in the North Pacific.

This is what sea lion pups look like:

photo from NOAA fisheries

This dataset is available on Blackboard as SeaLions.csv, or at this link. Once you download it, you can use the File Explorer to determine its location and read it into R in a couple of ways:

  1. From the command line: you can download the dataset and modify the following line of code:
SeaLions <- read.csv("insert the directory instead of this sentence/SeaLions.csv")

A directory is another way to refer to a folder or, simply, a location of a data file on your computer. You can get the address of the directory if you open the folder where you saved the file through File Explorer, right-click on the navigation bar and select Copy address as text option. Note: If you copy and paste the file directory in, you have to change the direction of the slashes from \ to /!

Note that csv is a text based file type (Comma Separated Values) - it just means that commas between entries indicate separate columns. When a program “reads” the file, it “knows” that a comma means the end of one column and the start of another one. You can save any Excel file as a csv using the Save As function. CSVs are by far the the most common and convenient file type used for loading into R.

  1. Alternatively, you can import datasets into R using the RStudio point-and-click interface. To do this:
    1. Navigate to the Files tab in the bottom right corner of RStudio
    1. Click on SeaLions.csv
    1. RStudio will prompt you to either view the file or import the dataset. You want to import, so hit Import File
    1. A pop-up window will appear, showing you the preview of the data frame. Click Import and observe that your file is now loaded - it should have appeared in your Environment in the top right corner of RStudio.

This method does the same exact thing as the line of code above. It will automatically input the proper code into the console and save your file to the environment. Note that by default the file will have the same name rather than a name you designate for it.

1.2 Working with data frames

Look at some properties of this data file, with the following functions:

is(SeaLions) # tells what type of files we have
## [1] "data.frame" "list"       "oldClass"   "vector"
names(SeaLions) # tells us the names of all the columns
## [1] "Island" "Weight" "Length" "Girth"  "Sex"
head(SeaLions) # shows the first several rows of the dataframe

Use a $ to extract a given column:

Length <- SeaLions$Length
Weight <-SeaLions$Weight
Island <- SeaLions$Island
Sex <- SeaLions$Sex

2 Summary Statistics

Just about all data can be broken into categorical data or continuous data. Continuous data is numeric, and you can calculate a bunch of familiar statistics (means, ranges, medians, etc.)

range(Length) # range
## [1]  93 126
median(Length) # median
## [1] 110
mean(Length) # mean
## [1] 109.8434
var(Length) # variance
## [1] 34.82854
sd(Length) # standard deviation
## [1] 5.901571

Categorical data comes in discrete, countable sets of options, called factors or levels. The best statistic for categorical data is just to count them. That’s done with the super useful table() function:

table(Sex)
## Sex
##   F   M 
## 226 272
table(Island)
## Island
## Antsiferov   Chirpoev   Lovushki    Raykoke   Srednova 
##        100         99        100        100         99
table(Island, Sex)
##             Sex
## Island        F  M
##   Antsiferov 38 62
##   Chirpoev   43 56
##   Lovushki   50 50
##   Raykoke    49 51
##   Srednova   46 53

3 Graphical Summaries

Categorical and Continuous data are illustrated and analyzed in different ways.

3.1 Continuous data

Continuous data contains a lot of different values. The best way to see the spread (distribution) of those data is with a histogram - the hist() functino:

Exercise 1: Produce a histogram of the Weight of the sea lion pups. Also, report the minimum, maximum and mean weight of the sea lion puops. Copy and paste that figure into a Word or Google document

3.2 Categorical data

For categorical data, you want to represent the counts. The best choices are bar charts:

barplot(table(Sex))
barplot(table(Island))

Or pie charts (which are very rarely used in scientific analyses).

pie(table(Island))

> Exercise 2: Just enter the following code to make a barplot of both sex and island. Copy paste the figure into a Word / Google document.

barplot(table(Island, Sex))

3.3 Boxplot

A boxplot shows us relationships between a continuous variable (like Length/Weight/Girth) and a discrete variable (like Island/Sex). The box contains the median, and 25% and 75% quantiles and shows points that are outside of an “extreme” range of 1.5 x the resepective interquantile range:

Which sex do you think is generally longer!?

Exercise 3: Produce a boxplot of the weight of the sea lions against Islands. Do you think there is a significant difference against Islands?

3.4 Scatterplot

Finally, a scatterplot shows us relationships between two continuous variables. It is the easiest plot:

plot(Length, Weight)

4 Linear models

At a basic level, we want to know what is the relationship between Weight and Length of sea lion pups. Specifically, for each cm of length, how much more (on average ) does a sea lion weigh.

In math, that’s a straightforward equation:

\[Y_i = \alpha + \beta X_i + \epsilon_i\]

Where:

  • \(i\) is the individual (i.e. each row in the data frame)
  • \(Y\) is the response variable (here: weight)
  • \(X\) is the predictor variable (here: length)
  • \(\alpha\) is the intercept
  • \(\beta\) is the slope, more generally: the regression coefficient
  • \(\epsilon\) is the random bit that’s “unexplained”

4.1 Fitting a linear model

Fitting a linear model is done with the simple (and powerful) lm() function. All we have to do is state which is the response and which is the predictor using the R notation Y ~ X.

The output of this function is complicated, so weĺl save it as a new object:

Weight.lm <- lm(Weight ~ Length)

The output of just the linear model is:

Weight.lm
## 
## Call:
## lm(formula = Weight ~ Length)
## 
## Coefficients:
## (Intercept)       Length  
##    -59.5108       0.8468

Which gives us an intercept and a slope:

  • The intercept \(\alpha = -59\) but we almost never care about the intercept
  • The slope is \(\beta = 0.85\).

What are the units of this slope!?

4.2 Plotting a linear model

This is done with the abline() function:

plot(Length, Weight)
abline(Weight.lm)

You can make plots prettier in lots of different of ways!

plot(Length, Weight, pch = 19, cex =0.5)
abline(Weight.lm, col = "red", lwd = 2)

4.3 Significance of linear model

This is nice! But is the effect “significant”? That means - is there an actual relationship between the two? One way to test this is to look at confidence intervals, which are related to the standard deviation, and at p-values.

All of this information is in a higher level of summary information provided by the summary command:

summary(Weight.lm)
## 
## Call:
## lm(formula = Weight ~ Length)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6832 -1.7238  0.0668  1.9294  7.7447 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -59.51078    2.22651  -26.73   <2e-16 ***
## Length        0.84685    0.02024   41.84   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.663 on 496 degrees of freedom
## Multiple R-squared:  0.7792, Adjusted R-squared:  0.7788 
## F-statistic:  1750 on 1 and 496 DF,  p-value: < 2.2e-16

There is lots to unpack here. But basically, we look straight at the Length row. The standard error is 0.02, so the confidence interval is \(0.85 \pm 2 \times 0.02 = \{0.81-0.89\}\). That is a very narrow confidence interval! Very far from “0”. The p-value column tells us what is the probability that there is NO relationship between length and weight?. That probability is VERY VERY VERY VERY low. So we can make a statement like: There is a significant relationship between length and weight of sea lion pups.*

5 Analyzing sex differences

Here’s an important question: Do male and female show the same length-weight relationship? We know that Steller sea lions are extremely sexually dimorphic (adult males are ~1000 kg, adult females closet to ~400 kg). Is that dimorphism expressed at an early stage as well?

5.1 Subsetting data

To answer these questions, we need to learn how to subset the data. This is done with square brackets [], and logical comparisons (like equal to: == or greater than >, etc.).

So, to pick out the length of the males and the females, you’d do

Weight.Male <- Weight[Sex == "M"]
Weight.Female <- Weight[Sex == "F"]

Now you can compare those means and standard deviations:

mean(Weight.Male)
## [1] 36.30074
mean(Weight.Female)
## [1] 30.15133

5.2 Statistical test of comparison

As a brief aside, if you want to statistically test whether there is a difference in the weight of male and female pups, you can do:

t.test(Weight.Male, Weight.Female)
## 
##  Welch Two Sample t-test
## 
## data:  Weight.Male and Weight.Female
## t = 14.76, df = 487.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.330777 6.968038
## sample estimates:
## mean of x mean of y 
##  36.30074  30.15133

OR, also fit a linear model:

summary(lm(Weight~Sex))
## 
## Call:
## lm(formula = Weight ~ Sex)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -20.8007  -3.1513   0.3487   3.1993  12.1993 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  30.1513     0.3170   95.12   <2e-16 ***
## SexM          6.1494     0.4289   14.34   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.765 on 496 degrees of freedom
## Multiple R-squared:  0.293,  Adjusted R-squared:  0.2916 
## F-statistic: 205.6 on 1 and 496 DF,  p-value: < 2.2e-16

Lots of good information here we can review in class.

5.3 Perform the analysis

You now have all the tools to answer the question!

Exercise 4: Answer the question Is the length-weight relationship of male and female pups different? by following these steps:

  1. Get the male and female Lengths and Weights
  2. Fit two linear models
  3. Plot the two sets of points and the fitted lines
  4. Report the estimated slope coefficient, the standard error.
  5. Compute the Confidence Intervals
  6. Do those confidence intervals overlap?

Sleepy family (my picture!)