Merging Data

Mar 2, 2026

Today

Why merge?
merge() and its arguments
The four types of merge
Merging by differently named columns
The suffixes argument
Many-to-one merges
Common pitfalls and debugging
Exercises

Why merge?

Real data almost never arrives in one tidy table.

Field measurements in one file, lab results in another
Species observations in one table, trait data in another
One sensor logs temperature, another logs precipitation

The problem: how do you combine two data frames that share some—but not all—rows and columns?

The three questions

Before you type merge(), answer three questions:

Which two data frames am I combining?
By which column(s) should they be matched?
What happens to rows that exist in one frame but not the other?

`merge()`

?merge

merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"),
      incomparables = NULL, ...)

Argument	What it does
`x`, `y`	The two data frames
`by`	Column(s) to match on (name or index)
`by.x`, `by.y`	Use when column names differ between frames
`all`	Keep all rows from both frames?
`all.x`	Keep all rows from `x`?
`all.y`	Keep all rows from `y`?
`suffixes`	Disambiguate duplicate column names

Example data: water quality

wq <- data.frame(
  month = c("Aug", "Sept", "Oct", "Nov"),
  DO    = c(12.3, 17.2, 14.0, 8.9),
  pH    = c(7.2, 6.9, 7.8, 7.7)
)
ec <- data.frame(
  month = c("Sept", "Oct", "Nov", "Dec"),
  Ecoli = c(120, 2080, 830, 7900)
)

Both have month
wq has Aug but not Dec
ec has Dec but not Aug

Check names before merging

intersect(names(wq), names(ec))

## [1] "month"

This is what merge() uses by default for by. Always check this yourself first, especially with large data frames where columns may share names by accident.

Inner merge (the default)

merge(wq, ec)

##   month   DO  pH Ecoli
## 1   Nov  8.9 7.7   830
## 2   Oct 14.0 7.8  2080
## 3  Sept 17.2 6.9   120

Only rows found in both frames are kept
Aug dropped (not in ec); Dec dropped (not in wq)
Result sorted alphabetically by month

This is the database equivalent of an inner join.

Outer merge: `all = TRUE`

merge(wq, ec, all = TRUE)

##   month   DO  pH Ecoli
## 1   Aug 12.3 7.2    NA
## 2   Dec   NA  NA  7900
## 3   Nov  8.9 7.7   830
## 4   Oct 14.0 7.8  2080
## 5  Sept 17.2 6.9   120

All rows from both frames kept
Missing values filled with NA
This is a full outer join

Left merge: `all.x = TRUE`

merge(wq, ec, all.x = TRUE)

##   month   DO  pH Ecoli
## 1   Aug 12.3 7.2    NA
## 2   Nov  8.9 7.7   830
## 3   Oct 14.0 7.8  2080
## 4  Sept 17.2 6.9   120

All rows from x (wq) kept
Unmatched rows from y (ec) dropped
Aug stays; Dec goes
This is a left join

Right merge: `all.y = TRUE`

merge(wq, ec, all.y = TRUE)

##   month   DO  pH Ecoli
## 1   Dec   NA  NA  7900
## 2   Nov  8.9 7.7   830
## 3   Oct 14.0 7.8  2080
## 4  Sept 17.2 6.9   120

All rows from y (ec) kept
Unmatched rows from x (wq) dropped
Dec stays; Aug goes
This is a right join

The four merges, summarized

Type	R code	Keeps from `x`	Keeps from `y`	SQL equivalent
Inner	`merge(x, y)`	Matched only	Matched only	`INNER JOIN`
Full outer	`merge(x, y, all=T)`	All	All	`FULL OUTER JOIN`
Left	`merge(x, y, all.x=T)`	All	Matched only	`LEFT JOIN`
Right	`merge(x, y, all.y=T)`	Matched only	All	`RIGHT JOIN`

Rule of thumb: if you’re not sure, start with all = TRUE so you can see what didn’t match, then decide what to discard.

When column names don’t match

Suppose the ec data used sampling_month instead of month:

ec2 <- ec
names(ec2)[1] <- "sampling_month"
names(ec2)

## [1] "sampling_month" "Ecoli"

`by.x` and `by.y`

merge(wq, ec2, by.x = "month", by.y = "sampling_month")

##   month   DO  pH Ecoli
## 1   Nov  8.9 7.7   830
## 2   Oct 14.0 7.8  2080
## 3  Sept 17.2 6.9   120

by.x says which column in x to match on
by.y says which column in y to match on
The result uses the name from x

Merging by multiple columns

Sometimes a single column isn’t enough to uniquely identify a row. Common in ecological data where you have site + year, or plot + date.

sites_a <- data.frame(
  site = c("A", "A", "B", "B"),
  year = c(2020, 2021, 2020, 2021),
  temp = c(15.2, 16.1, 12.8, 13.5)
)
sites_b <- data.frame(
  site = c("A", "A", "B", "B"),
  year = c(2020, 2021, 2020, 2021),
  precip = c(820, 790, 1100, 1050)
)

Merging by multiple columns

merge(sites_a, sites_b)

##   site year temp precip
## 1    A 2020 15.2    820
## 2    A 2021 16.1    790
## 3    B 2020 12.8   1100
## 4    B 2021 13.5   1050

intersect(names(sites_a), names(sites_b)) returns c("site", "year")
Both columns used automatically
Each site-year combination matched correctly

The `suffixes` argument

What happens when both data frames have a column with the same name that is not the merge key?

df1 <- data.frame(id = 1:3, value = c(10, 20, 30))
df2 <- data.frame(id = 1:3, value = c(100, 200, 300))
merge(df1, df2)

## [1] id    value
## <0 rows> (or 0-length row.names)

Customizing suffixes

The defaults .x and .y aren’t very informative. Use suffixes to make them meaningful:

merge(df1, df2, suffixes = c("_field", "_lab"))

## [1] id    value
## <0 rows> (or 0-length row.names)

This is much more readable when you come back to your code six months later.

Many-to-one merges

Not all merges are one-to-one. Sometimes you want to attach a lookup table where one frame has fewer rows:

observations <- data.frame(
  species = c("PIST", "PIST", "ACRU", "ACRU", "QURU"),
  dbh_cm  = c(25, 32, 18, 22, 45)
)
lookup <- data.frame(
  species   = c("PIST", "ACRU", "QURU"),
  common    = c("White Pine", "Red Maple", "Red Oak")
)

Many-to-one merges

merge(observations, lookup)

##   species dbh_cm     common
## 1    ACRU     18  Red Maple
## 2    ACRU     22  Red Maple
## 3    PIST     25 White Pine
## 4    PIST     32 White Pine
## 5    QURU     45    Red Oak

lookup has 3 rows, observations has 5
Each species code in observations matched to its common name
Rows in observations are not lost—they expand to include the lookup info
This is one of the most useful merge patterns

Checking your merge with `dim()`

Always check dimensions before and after merging.

dim(wq)

## [1] 4 3

dim(ec)

## [1] 4 2

dim(merge(wq, ec))           # inner

## [1] 3 4

dim(merge(wq, ec, all = TRUE)) # full outer

## [1] 5 4

If the merged frame is unexpectedly large, you likely have duplicate keys causing a Cartesian product (every row in x paired with every matching row in y).

Common pitfall: accidental Cartesian product

d1 <- data.frame(id = c(1, 1, 2), val1 = c("a", "b", "c"))
d2 <- data.frame(id = c(1, 1, 2), val2 = c("x", "y", "z"))
merge(d1, d2)

##   id val1 val2
## 1  1    a    x
## 2  1    a    y
## 3  1    b    x
## 4  1    b    y
## 5  2    c    z

id = 1 appears twice in both frames
Result: 2 x 2 = 4 rows for id = 1
This blows up fast with real data

Debugging checklist

When merge() gives unexpected results:

Check intersect(names(x), names(y)) — are the right columns being matched?
Check dim() before and after — did the row count change the way you expected?
Check for duplicate keys — table(table(x[, by_cols])) should show all 1s if keys are unique
Check spelling — "Sept" vs "September" will not match
Check class — merging a factor column with a character column can produce surprises

Quick reference: `identical()`

identical() tests whether two objects are exactly the same:

wq2 <- merge(wq, ec, by = "month")
wq3 <- merge(wq, ec, by = 1)
wq4 <- merge(wq, ec)
identical(wq2, wq3)

## [1] TRUE

identical(wq3, wq4)

## [1] TRUE

Useful for confirming that different merge() calls produce the same result.

Tidyverse joins: `dplyr`

The dplyr package provides a family of _join() functions that do the same thing as merge() but with more explicit naming.

library(dplyr)

`dplyr` join functions

Each join type gets its own function—no toggling all, all.x, all.y:

`dplyr` function	Equivalent `merge()`	Keeps
`inner_join(x, y)`	`merge(x, y)`	Rows in both
`full_join(x, y)`	`merge(x, y, all = TRUE)`	All rows
`left_join(x, y)`	`merge(x, y, all.x = TRUE)`	All rows from `x`
`right_join(x, y)`	`merge(x, y, all.y = TRUE)`	All rows from `y`

The by argument works similarly, and column name mismatches use by = c("name_in_x" = "name_in_y") syntax.

`dplyr` join examples

inner_join(wq, ec, by = "month")

##   month   DO  pH Ecoli
## 1  Sept 17.2 6.9   120
## 2   Oct 14.0 7.8  2080
## 3   Nov  8.9 7.7   830

left_join(wq, ec, by = "month")

##   month   DO  pH Ecoli
## 1   Aug 12.3 7.2    NA
## 2  Sept 17.2 6.9   120
## 3   Oct 14.0 7.8  2080
## 4   Nov  8.9 7.7   830

Handling mismatched names in `dplyr`

Recall ec2 where the column was named sampling_month:

full_join(wq, ec2, by = c("month" = "sampling_month"))

##   month   DO  pH Ecoli
## 1   Aug 12.3 7.2    NA
## 2  Sept 17.2 6.9   120
## 3   Oct 14.0 7.8  2080
## 4   Nov  8.9 7.7   830
## 5   Dec   NA  NA  7900

This is the dplyr equivalent of merge(wq, ec2, by.x = "month", by.y = "sampling_month", all = TRUE).

Filtering joins in `dplyr`

dplyr also has two joins that don’t add columns—they just filter rows:

semi_join(wq, ec, by = "month")  # rows in wq that HAVE a match in ec

##   month   DO  pH
## 1  Sept 17.2 6.9
## 2   Oct 14.0 7.8
## 3   Nov  8.9 7.7

anti_join(wq, ec, by = "month")  # rows in wq that have NO match in ec

##   month   DO  pH
## 1   Aug 12.3 7.2

These have no direct merge() equivalent and are genuinely useful for data QC—e.g., “which of my field sites are missing from the lab results?”

`merge()` vs `dplyr` joins

Advantages of merge():

No package dependency (base R)
One function to learn
Fine for most datasets

Advantages of dplyr joins:

Function names make the join type explicit
Preserves row order (base merge() sorts by default)
semi_join() and anti_join() are handy extras
Somewhat faster on large data

Other alternatives: `data.table`

data.table::merge() uses the same arguments as base merge(), so switching is nearly free
Speed advantage is dramatic: merges on millions of rows that take minutes in base R finish in seconds
Memory-efficient—modifies data in place rather than copying
The package also has its own [ syntax for joins (X[Y, on = ...]) which is extremely concise once learned
Worth adopting if your data routinely exceeds ~1 million rows or if you’re doing repeated merges in a loop

Other alternatives: `sqldf`

sqldf::sqldf() lets you write actual SQL queries directly on R data frames
Useful if you already know SQL from database work or GIS (e.g., PostGIS, Access, SQLite)
Handles complex joins, subqueries, and aggregations in a single statement
A good bridge if you’re moving between R and database workflows

Exercise 1

Create these two data frames and perform the four types of merge (inner, full outer, left, right). For each, check the dimensions and identify which rows were kept or dropped:

growth <- data.frame(
  plot = c("P1", "P2", "P3", "P4"),
  height_cm = c(120, 95, 140, 110)
)
soil <- data.frame(
  plot = c("P2", "P3", "P4", "P5"),
  organic_pct = c(3.2, 5.1, 2.8, 4.5)
)

Summary

merge() combines two data frames by shared columns
Four types: inner (default), full outer, left, right
Always check intersect(names(x), names(y)) before merging
Always check dim() before and after
Use by.x and by.y when column names differ
Use suffixes to make duplicate column names informative
Watch out for Cartesian products from duplicate keys
Start with all = TRUE when exploring, then restrict as needed

Today

Why merge?

The three questions

merge()

Example data: water quality

Check names before merging

Inner merge (the default)

Outer merge: all = TRUE

Left merge: all.x = TRUE

Right merge: all.y = TRUE

The four merges, summarized

When column names don’t match

by.x and by.y

Merging by multiple columns

Merging by multiple columns

The suffixes argument

Customizing suffixes

Many-to-one merges

Many-to-one merges

Checking your merge with dim()

Common pitfall: accidental Cartesian product

Debugging checklist

Quick reference: identical()

Tidyverse joins: dplyr

dplyr join functions

dplyr join examples

Handling mismatched names in dplyr

Filtering joins in dplyr

merge() vs dplyr joins

Other alternatives: data.table

Other alternatives: sqldf

Exercise 1

Summary

`merge()`

Outer merge: `all = TRUE`

Left merge: `all.x = TRUE`

Right merge: `all.y = TRUE`

`by.x` and `by.y`

The `suffixes` argument

Checking your merge with `dim()`

Quick reference: `identical()`

Tidyverse joins: `dplyr`

`dplyr` join functions

`dplyr` join examples

Handling mismatched names in `dplyr`

Filtering joins in `dplyr`

`merge()` vs `dplyr` joins

Other alternatives: `data.table`

Other alternatives: `sqldf`