The following examples should give you a first look at what R does and how it works.
R is a command-line program, which means commands are entered
line-by-line at the prompt. Being a programming language it is very
finicky. Everything has to be entered exactly right - including
case-sensitivity. So, a Plot
entry is different from
plot
!
There are two ways of entering commands (telling R to do a certain
thing): either typing them out carefully into the “Console Window” (the
lower-left window in Rstudio) and hitting Enter
or writing
and editing lines in the script window (upper-left window in Rstudio),
and “passing” the code into the console by hitting
Ctrl+Enter
.
In general, it is better to do all of your coding in a script window,
and then save the raw code file as a text document, which you can
revisit and re-run at any point later. To create a new R script
document, go to the upper-left corner, press File
-
New file
- R Script
or press Ctrl
+ Shift
+ N
1+2
## [1] 3
3^6
## [1] 729
sqrt((20-19)^2 + (19-19)^2 + (19-18)^2)/2
## [1] 0.7071068
12345*54312
## [1] 670481640
The assignment operator is <-
. It’s supposed to look
like an arrow pointing left (the shortcut for entering it is
Alt
+ -
).
X <- 5 # sets X equal to 5
Using the assignment operator sets the value of X
but
doesn’t print any output. To see what X
is, you need to
type:
X
## [1] 5
Note that X
now appears in the upper-right panel of
Rstudio, letting you know that there is now an object in memory (also
called the “Environment”) called X
.
Now, you can use X
as if it were a number
X*2
## [1] 10
X^X
## [1] 3125
Note that you can name a variable ANYTHING, as long as it starts with a letter.
Fred <- 5
Nancy <- Fred*2
Fred + Nancy
## [1] 15
Obviously, X
can be many things more than just a single
number. The most important kind of object in R is a “vector”, which is a
series of inputs (and therefore resembles “data”).
c()
is a function - a very useful function that creates
“vectors”. In all functions, arguments are passed within
parentheses.
We can use the c()
function as follows:
X <- c(3,4,5) # sets X equal to the vector (3,4,5)
X
## [1] 3 4 5
Now, let’s do some arithmetic with this vector:
X + 1
## [1] 4 5 6
X*2
## [1] 6 8 10
X^2
## [1] 9 16 25
((X+X^2/2)/X)^2
## [1] 6.25 9.00 12.25
Let’s say you had 5 people count all the sea lions on a rookery. Here are their counts:
counts <- c(150,125,105,110,140)
The average - or arithemtic mean - of these counts gives us a good point estimate of sea lions at this location. This is defined (in math terms) as:
\[\widehat{N} = {1\over k} \sum_{i=1}^k N_i\]
In this notation, \(N_i\) is the \(i\)’th count, where \(i = \{1,2, ... k\}\), i.e. there are \(k\) counts.
Two ways to do this in R. Decompose everything:
k <- length(counts)
sum(counts)/k
## [1] 126
Or just use the mean()
function:
mean(counts)
## [1] 126
What we ultimately want is the standard error of this estimate. But first, we have to go through its close cousin, the standard deviation.
The difference between these two is confusing and important. Here are some bullet points:
The standard deviation (SD) is a measure of total spread of data.
The standard error (SE) is a measure of the precision of an estimate. You need it for constructing confidence intervals.
The standard deviation is (basically) always estimates with the same formula: \[\text{SD}(N) = \sqrt{\frac{1}{k-1}\sum_{i= 1}^k \left(N_i - \widehat{N} \right)^2}\]
The standard error is estimated in lots of different ways totally depending on the kind of estimate you’re computing it for. In this example (the standard error of an estimate of the mean), it’s formula would be:
\[\text{SE}(\widehat{N}) = \frac{1}{\sqrt{k}} \text{SD}(N)\]
There’s kind of a lot going on in the standard deviation formula above! But it is easy to pick apart the pieces in R.
counts - mean(counts)
## [1] 24 -1 -21 -16 14
Some are positive, some negative, and they add up to 0.
sum((counts - mean(counts))^2)
## [1] 1470
That’s a big number!
sqrt(sum((counts - mean(counts))^2) / (k-1))
## [1] 19.17029
Voilá! The standard deviation.
Of course all of that can be done in one quick command in R:
count.sd <- sd(counts)
count.sd
## [1] 19.17029
Super fast and convenient! But isn’t it nice to know how the sausage is made?
Notice also that this time we saved the output into an object called
count.sd
.
Once we have the SD, it’s easy to get the SE:
count.se <- count.sd / sqrt(k)
count.se
## [1] 8.573214
So now If you want to report a confidence interval (\(\widehat{N} \pm 2 \times SE(\widehat{N})\)) you can do that like this:
mean(counts) + c(-2,2)*count.se
## [1] 108.8536 143.1464
Use the formulas and code above to compute the point estimate and standard error around the 3 flag counts from your group. Report: > (1) the point estimate; (2) the standard error, (3) the confidence interval.
Data is most often multiple vectors of the same length. If we create
a second vector Y
we can use it alongside our first vector
X
using the data.frame()
command. Now, both
vectors became columns in our new data frame!
Y <- c(1,2,3)
data.frame(X,Y)
Running that command as a single line just outputs the data and allows us to look at it. To perform operations with it, you should save it as another object:
mydata <- data.frame(X,Y)
A data frame has columns with names:
ncol(mydata) # ncol() gives us a number of columns that this data frame has
## [1] 2
names(mydata) # names() lists all column names that this data frame has
## [1] "X" "Y"
A column can be extracted (or called) from a dataframe with a
$
:
mydata$X
## [1] 3 4 5
mydata$Y
## [1] 1 2 3
The following examples should explain how to import data frames and to work with the data contained within them.
We will use Steller sea lion (Eumotopias jubatus) data as an example. These are weights, lengths, and girths (basically, under the arm/flipper pits) of sea lion pups about two months after birth as part of a tagging mark-recapture study. These data were collected (in part by Dr. Gurarie) on five islands in the North Pacific.
This is what sea lion pups look like:
This dataset is available on Blackboard
as SeaLions.csv
, or at this
link. Once you download it, you can use the File Explorer to
determine its location and read it into R in a couple of ways:
SeaLions <- read.csv("insert the directory instead of this sentence/SeaLions.csv")
A directory is another way to refer to a folder or, simply, a
location of a data file on your computer. You can get the address of the
directory if you open the folder where you saved the file through File
Explorer, right-click on the navigation bar and select
Copy address as text
option. Note: If you copy and paste
the file directory in, you have to change the direction of the
slashes from \
to /
!
Note that csv
is a text based file type
(Comma Separated Values) - it just means that commas between entries
indicate separate columns. When a program “reads” the file, it “knows”
that a comma means the end of one column and the start of another one.
You can save any Excel file as a csv
using the Save
As function. CSVs are by far the the most common and convenient
file type used for loading into R.
R
using the
RStudio
point-and-click interface. To do this:
- Navigate to the
Files
tab in the bottom right corner of RStudio- Click on
SeaLions.csv
- RStudio will prompt you to either view the file or import the dataset. You want to import, so hit
Import File
- A pop-up window will appear, showing you the preview of the data frame. Click
Import
and observe that your file is now loaded - it should have appeared in your Environment in the top right corner of RStudio.
This method does the same exact thing as the line of code above. It will automatically input the proper code into the console and save your file to the environment. Note that by default the file will have the same name rather than a name you designate for it.
Look at some properties of this data file, with the following functions:
is(SeaLions) # tells what type of files we have
## [1] "data.frame" "list" "oldClass" "vector"
names(SeaLions) # tells us the names of all the columns
## [1] "Island" "Weight" "Length" "Girth" "Sex"
head(SeaLions) # shows the first several rows of the dataframe
Use a $
to extract a given column:
Length <- SeaLions$Length
Weight <-SeaLions$Weight
Island <- SeaLions$Island
Sex <- SeaLions$Sex
Some basic summary statistics include:
range(Length) # range
## [1] 93 126
median(Length) # median
## [1] 110
mean(Length) # mean
## [1] 109.8434
var(Length) # variance
## [1] 34.82854
sd(Length) # standard deviation
## [1] 5.901571
A histogram (invoked by hist()
command) can show us the
distribution of a single continuous variable:
Produce a histogram of the Weight of the sea lion pups. Also, report the minimum, maximum and mean weight of the sea lion pops.
A boxplot shows us relationships between a continuous variable (like Length/Weight/Girth) and a discrete variable (like Island/Sex):
Which sex is larger!?
Produce a boxplot of the weight of the sea lions against Islands. Do you think there is a significant difference against Islands?
Finally, a scatterplot shows us relationships between two continuous variables: