Goals
- Load data
- Understand continuous and discrete data
- summary statistics and visualizations
- Fit and analyze a simple linear model
There are 4 Exercises in the lab that need to be submitted on Blackboard.
The following examples should explain how to import data frames and to work with the data contained within them.
We will use Steller sea lion (Eumotopias jubatus) data as an example. These are weights, lengths, and girths (basically, under the arm/flipper pits) of sea lion pups about two months after birth as part of a tagging mark-recapture study. These data were collected (in part by Dr. Gurarie) on five islands in the North Pacific.
This is what sea lion pups look like:
This dataset is available on Blackboard
as SeaLions.csv
, or at this
link. Once you download it, you can use the File Explorer to
determine its location and read it into R in a couple of ways:
SeaLions <- read.csv("insert the directory instead of this sentence/SeaLions.csv")
A directory is another way to refer to a folder or, simply, a
location of a data file on your computer. You can get the address of the
directory if you open the folder where you saved the file through File
Explorer, right-click on the navigation bar and select
Copy address as text
option. Note: If you copy and paste
the file directory in, you have to change the direction of the
slashes from \
to /
!
Note that csv
is a text based file type
(Comma Separated Values) - it just means that commas between entries
indicate separate columns. When a program “reads” the file, it “knows”
that a comma means the end of one column and the start of another one.
You can save any Excel file as a csv
using the Save
As function. CSVs are by far the the most common and convenient
file type used for loading into R.
R
using the
RStudio
point-and-click interface. To do this:Files
tab in the bottom right corner of
RStudioSeaLions.csv
Import File
Import
and observe that your file is now
loaded - it should have appeared in your Environment in the top right
corner of RStudio.This method does the same exact thing as the line of code above. It will automatically input the proper code into the console and save your file to the environment. Note that by default the file will have the same name rather than a name you designate for it.
Look at some properties of this data file, with the following functions:
is(SeaLions) # tells what type of files we have
## [1] "data.frame" "list" "oldClass" "vector"
names(SeaLions) # tells us the names of all the columns
## [1] "Island" "Weight" "Length" "Girth" "Sex"
head(SeaLions) # shows the first several rows of the dataframe
Use a $
to extract a given column:
Length <- SeaLions$Length
Weight <-SeaLions$Weight
Island <- SeaLions$Island
Sex <- SeaLions$Sex
Just about all data can be broken into categorical data or continuous data. Continuous data is numeric, and you can calculate a bunch of familiar statistics (means, ranges, medians, etc.)
range(Length) # range
## [1] 93 126
median(Length) # median
## [1] 110
mean(Length) # mean
## [1] 109.8434
var(Length) # variance
## [1] 34.82854
sd(Length) # standard deviation
## [1] 5.901571
Categorical data comes in discrete, countable sets of options, called
factors or levels. The best statistic
for categorical data is just to count them. That’s done with the super
useful table()
function:
table(Sex)
## Sex
## F M
## 226 272
table(Island)
## Island
## Antsiferov Chirpoev Lovushki Raykoke Srednova
## 100 99 100 100 99
table(Island, Sex)
## Sex
## Island F M
## Antsiferov 38 62
## Chirpoev 43 56
## Lovushki 50 50
## Raykoke 49 51
## Srednova 46 53
Categorical and Continuous data are illustrated and analyzed in different ways.
Continuous data contains a lot of different values.
The best way to see the spread (distribution) of those data is
with a histogram - the hist()
functino:
Exercise 1: Produce a histogram of the Weight of the sea lion pups. Also, report the minimum, maximum and mean weight of the sea lion puops. Copy and paste that figure into a Word or Google document
For categorical data, you want to represent the counts. The best choices are bar charts:
barplot(table(Sex))
barplot(table(Island))
Or pie charts (which are very rarely used in scientific analyses).
pie(table(Island))
> Exercise 2: Just enter the following code to make a barplot of both sex and island. Copy paste the figure into a Word / Google document.
barplot(table(Island, Sex))
A boxplot shows us relationships between a continuous variable (like Length/Weight/Girth) and a discrete variable (like Island/Sex). The box contains the median, and 25% and 75% quantiles and shows points that are outside of an “extreme” range of 1.5 x the resepective interquantile range:
Which sex do you think is generally longer!?
Exercise 3: Produce a boxplot of the weight of the sea lions against Islands. Do you think there is a significant difference against Islands?
Finally, a scatterplot shows us relationships between two continuous variables. It is the easiest plot:
plot(Length, Weight)
At a basic level, we want to know what is the relationship between Weight and Length of sea lion pups. Specifically, for each cm of length, how much more (on average ) does a sea lion weigh.
In math, that’s a straightforward equation:
\[Y_i = \alpha + \beta X_i + \epsilon_i\]
Where:
Fitting a linear model is done with the simple (and powerful)
lm()
function. All we have to do is state which is the
response and which is the predictor using the R
notation Y ~ X
.
The output of this function is complicated, so weĺl save it as a new object:
Weight.lm <- lm(Weight ~ Length)
The output of just the linear model is:
Weight.lm
##
## Call:
## lm(formula = Weight ~ Length)
##
## Coefficients:
## (Intercept) Length
## -59.5108 0.8468
Which gives us an intercept and a slope:
What are the units of this slope!?
This is done with the abline()
function:
plot(Length, Weight)
abline(Weight.lm)
You can make plots prettier in lots of different of ways!
plot(Length, Weight, pch = 19, cex =0.5)
abline(Weight.lm, col = "red", lwd = 2)
This is nice! But is the effect “significant”? That means - is there an actual relationship between the two? One way to test this is to look at confidence intervals, which are related to the standard deviation, and at p-values.
All of this information is in a higher level of summary information provided by the summary command:
summary(Weight.lm)
##
## Call:
## lm(formula = Weight ~ Length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6832 -1.7238 0.0668 1.9294 7.7447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -59.51078 2.22651 -26.73 <2e-16 ***
## Length 0.84685 0.02024 41.84 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.663 on 496 degrees of freedom
## Multiple R-squared: 0.7792, Adjusted R-squared: 0.7788
## F-statistic: 1750 on 1 and 496 DF, p-value: < 2.2e-16
There is lots to unpack here. But basically, we look straight at the
Length
row. The standard error is 0.02, so the confidence
interval is \(0.85 \pm 2 \times 0.02 =
\{0.81-0.89\}\). That is a very narrow confidence interval! Very
far from “0”. The p-value column tells us what is the
probability that there is NO relationship between length and
weight?. That probability is VERY VERY VERY VERY low. So we can
make a statement like: There is a
significant relationship between length and
weight of sea lion pups.*
Here’s an important question: Do male and female show the same length-weight relationship? We know that Steller sea lions are extremely sexually dimorphic (adult males are ~1000 kg, adult females closet to ~400 kg). Is that dimorphism expressed at an early stage as well?
To answer these questions, we need to learn how to subset the data.
This is done with square brackets []
, and
logical comparisons (like equal to: ==
or
greater than >
, etc.).
So, to pick out the length of the males and the females, you’d do
Weight.Male <- Weight[Sex == "M"]
Weight.Female <- Weight[Sex == "F"]
Now you can compare those means and standard deviations:
mean(Weight.Male)
## [1] 36.30074
mean(Weight.Female)
## [1] 30.15133
As a brief aside, if you want to statistically test whether there is a difference in the weight of male and female pups, you can do:
t.test(Weight.Male, Weight.Female)
##
## Welch Two Sample t-test
##
## data: Weight.Male and Weight.Female
## t = 14.76, df = 487.03, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.330777 6.968038
## sample estimates:
## mean of x mean of y
## 36.30074 30.15133
OR, also fit a linear model:
summary(lm(Weight~Sex))
##
## Call:
## lm(formula = Weight ~ Sex)
##
## Residuals:
## Min 1Q Median 3Q Max
## -20.8007 -3.1513 0.3487 3.1993 12.1993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.1513 0.3170 95.12 <2e-16 ***
## SexM 6.1494 0.4289 14.34 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.765 on 496 degrees of freedom
## Multiple R-squared: 0.293, Adjusted R-squared: 0.2916
## F-statistic: 205.6 on 1 and 496 DF, p-value: < 2.2e-16
Lots of good information here we can review in class.
You now have all the tools to answer the question!
Exercise 4: Answer the question Is the length-weight relationship of male and female pups different? by following these steps:
- Get the male and female Lengths and Weights
- Fit two linear models
- Plot the two sets of points and the fitted lines
- Report the estimated slope coefficient, the standard error.
- Compute the Confidence Intervals
- Do those confidence intervals overlap?