Part 1: Basics

The following examples should give you a first look at what R does and how it works.

Introduction

R is a command-line program, which means commands are entered line-by-line at the prompt. Being a programming language it is very finicky. Everything has to be entered exactly right - including case-sensitivity. So, a Plot entry is different from plot!

There are two ways of entering commands (telling R to do a certain thing): either typing them out carefully into the “Console Window” (the lower-left window in Rstudio) and hitting Enter or writing and editing lines in the script window (upper-left window in Rstudio), and “passing” the code into the console by hitting Ctrl+Enter.

In general, it is better to do all of your coding in a script window, and then save the raw code file as a text document, which you can revisit and re-run at any point later. To create a new R script document, go to the upper-left corner, press File - New file - R Script or press Ctrl + Shift + N

R is a calculator

1+2
## [1] 3
3^6
## [1] 729
sqrt((20-19)^2 + (19-19)^2 + (19-18)^2)/2
## [1] 0.7071068
12345*54312
## [1] 670481640

Assigning variable names

The assignment operator is <-. It’s supposed to look like an arrow pointing left (the shortcut for entering it is Alt + -).

X <- 5      # sets X equal to 5 

Using the assignment operator sets the value of X but doesn’t print any output. To see what X is, you need to type:

X
## [1] 5

Note that X now appears in the upper-right panel of Rstudio, letting you know that there is now an object in memory (also called the “Environment”) called X.

Now, you can use X as if it were a number

X*2
## [1] 10
X^X
## [1] 3125

Note that you can name a variable ANYTHING, as long as it starts with a letter.

Fred <- 5
Nancy <- Fred*2
Fred + Nancy
## [1] 15

Vectors

Obviously, X can be many things more than just a single number. The most important kind of object in R is a “vector”, which is a series of inputs (and therefore resembles “data”).

c() is a function - a very useful function that creates “vectors”. In all functions, arguments are passed within parentheses.

We can use the c() function as follows:

X <- c(3,4,5)   # sets X equal to the vector (3,4,5)
X
## [1] 3 4 5

Now, let’s do some arithmetic with this vector:

X + 1
## [1] 4 5 6
X*2
## [1]  6  8 10
X^2
## [1]  9 16 25
((X+X^2/2)/X)^2
## [1]  6.25  9.00 12.25

Some simple estimates of abundance with standard errors

Let’s say you had 5 people count all the sea lions on a rookery. Here are their counts:

counts <- c(150,125,105,110,140)

The average - or arithemtic mean - of these counts gives us a good point estimate of sea lions at this location. This is defined (in math terms) as:

\[\widehat{N} = {1\over k} \sum_{i=1}^k N_i\]

Two ways to do this in R. Decompose everything:

k <- length(counts)
sum(counts)/k
## [1] 126

Or just use the mean() function:

mean(counts)
## [1] 126

A little trickier is the standard error of this estimate. Remember, the standard error of an estimate is the standard deviation of all the estimates. The standard deviation is a measure of the spread around the mean. A formula for it is given by:

\[\text{SE}(\widehat{N}) = \sqrt{\frac{1}{k-1}\sum_{i= 1}^k \left(N_i - \widehat{N} \right)^2}\]

There’s kind of a lot going on here! But easy to pick apart in R.

  1. The difference between the observations and the means (the deviations) are:
counts - mean(counts)
## [1]  24  -1 -21 -16  14

Some are positive, some negative, and they add up to 0.

Next, square them and sum them:

sum((counts - mean(counts))^2)
## [1] 1470

That’s a big number!

Now, divide by \(k-1\) and take the square root of the whole thing:

sqrt(sum((counts - mean(counts))^2) / (k-1))
## [1] 19.17029

Voilá! The standard error.

Of course all of that can be done in one quick command in R:

sd(counts)
## [1] 19.17029

But isn’t it nice to know how the sausage is made!

If you want to report a confidence interval (\(\widehat{N} \pm 2 \times SE(\widehat{N})\)) yuo can do that like this:

mean(counts) + c(-2,2)*sd(counts)
## [1]  87.65942 164.34058

Exercise 1: Calculate point estimate and standard error of your Flag Counts

Use the formulas and code above to compute the point estimate and standard error around the 3 flag counts from your group. Report: > (1) the point estimate; (2) the standard error, (3) the confidence interval.

Multiple Vectors and Data Frames

Data is most often multiple vectors of the same length. If we create a second vector Y we can use it alongside our first vector X using the data.frame() command. Now, both vectors became columns in our new data frame!

Y <- c(1,2,3)
data.frame(X,Y)
##   X Y
## 1 3 1
## 2 4 2
## 3 5 3

Running that command as a single line just outputs the data and allows us to look at it. To perform operations with it, you should save it as another object:

mydata <- data.frame(X,Y)

A data frame has columns with names:

ncol(mydata) # ncol() gives us a number of columns that this data frame has
## [1] 2
names(mydata) # names() lists all column names that this data frame has 
## [1] "X" "Y"

A column can be extracted (or called) from a dataframe with a $:

mydata$X
## [1] 3 4 5
mydata$Y
## [1] 1 2 3

Part 2: Loading and Exploring Data

The following examples should explain how to import data frames and to work with the data contained within them.

Loading Data

We will use Steller sea lion (Eumotopias jubatus) data as an example. These are weights, lengths, and girths (basically, under the arm/flipper pits) of sea lion pups about two months after birth as part of a tagging mark-recapture study. These data were collected (in part by Dr. Gurarie) on five islands in the North Pacific.

This is what sea lion pups look like:

This dataset is available on Blackboard as SeaLions.csv, or at this link. Once you download it, you can use the File Explorer to determine its location and read it into R in a couple of ways:

  1. From the command line: you can download the dataset and modify the following line of code:
SeaLions <- read.csv("insert the directory instead of this sentence/SeaLions.csv")

A directory is another way to refer to a folder or, simply, a location of a data file on your computer. You can get the address of the directory if you open the folder where you saved the file through File Explorer, right-click on the navigation bar and select Copy address as text option. Note: If you copy and paste the file directory in, you have to change the direction of the slashes from \ to /!

Note that csv is a text based file type (Comma Separated Values) - it just means that commas between entries indicate separate columns. When a program “reads” the file, it “knows” that a comma means the end of one column and the start of another one. You can save any Excel file as a csv using the Save As function. CSVs are by far the the most common and convenient file type used for loading into R.

  1. Alternatively, you can import datasets into R using the RStudio point-and-click interface. To do this:
  1. Navigate to the Files tab in the bottom right corner of RStudio
  2. Click on SeaLions.csv
  3. RStudio will prompt you to either view the file or import the dataset. You want to import, so hit Import File
  4. A pop-up window will appear, showing you the preview of the data frame. Click Import and observe that your file is now loaded - it should have appeared in your Environment in the top right corner of RStudio.

This method does the same exact thing as the line of code above. It will automatically input the proper code into the console and save your file to the environment. Note that by default the file will have the same name rather than a name you designate for it.

Working with data frames

Look at some properties of this data file, with the following functions:

is(SeaLions) # tells what type of files we have
## [1] "data.frame" "list"       "oldClass"   "vector"
names(SeaLions) # tells us the names of all the columns
##  [1] "Island"      "TaggingDate" "ID"          "Weight"      "Length"     
##  [6] "Girth"       "BCI"         "Sex"         "Age"         "DOB"
head(SeaLions) # shows the first several rows of the dataframe
##     Island TaggingDate     ID Weight Length Girth       BCI Sex Age  DOB
## 1 Chirpoev   7/10/2005 Br800L   15.5     93  69.0 0.7419355   M  NA <NA>
## 2 Chirpoev   7/10/2005 Br801L   29.0    106  71.0 0.6698113   F  NA <NA>
## 3 Chirpoev   7/10/2005 Br802L   35.5    112  76.0 0.6785714   M  NA <NA>
## 4 Chirpoev   7/10/2005 Br803L   32.0    107  72.0 0.6728972   M  NA <NA>
## 5 Chirpoev   7/10/2005 Br804L   32.0    105  73.5 0.7000000   M  NA <NA>
## 6 Chirpoev   7/10/2005 Br805L   33.5    111  72.0 0.6486486   M  NA <NA>

Use a $ to extract a given column:

Length <- SeaLions$Length
Weight <-SeaLions$Weight
Island <- SeaLions$Island
Sex <- SeaLions$Sex

Summary Statistics

Some basic summary statistics include:

range(Length) # range
## [1]  93 126
median(Length) # median
## [1] 110
mean(Length) # mean
## [1] 109.8434
var(Length) # variance
## [1] 34.82854
sd(Length) # standard deviation
## [1] 5.901571

Graphical Summaries

Histogram

A histogram (invoked by hist() command) can show us the distribution of a single continuous variable:

Exercise 2:

Produce a histogram of the Weight of the sea lion pups. Also, report the minimum, maximum and mean weight of the sea lion pops.

Boxplot

A boxplot shows us relationships between a continuous variable (like Length/Weight/Girth) and a discrete variable (like Island/Sex):

Which sex is larger!?

Exercise 3:

Produce a boxplot of the weight of the sea lions against Islands. Do you think there is a significant difference against Islands?

Scatterplot

A scatterplot shows us relationships between two continuous variables: