- Why merge?
merge()and its arguments- The four types of merge
- Merging by differently named columns
- The
suffixesargument - Many-to-one merges
- Common pitfalls and debugging
- Exercises
Mar 2, 2026
merge() and its argumentssuffixes argumentReal data almost never arrives in one tidy table.
The problem: how do you combine two data frames that share some—but not all—rows and columns?
Before you type merge(), answer three questions:
merge()?merge
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"),
incomparables = NULL, ...)
| Argument | What it does |
|---|---|
x, y |
The two data frames |
by |
Column(s) to match on (name or index) |
by.x, by.y |
Use when column names differ between frames |
all |
Keep all rows from both frames? |
all.x |
Keep all rows from x? |
all.y |
Keep all rows from y? |
suffixes |
Disambiguate duplicate column names |
wq <- data.frame(
month = c("Aug", "Sept", "Oct", "Nov"),
DO = c(12.3, 17.2, 14.0, 8.9),
pH = c(7.2, 6.9, 7.8, 7.7)
)
ec <- data.frame(
month = c("Sept", "Oct", "Nov", "Dec"),
Ecoli = c(120, 2080, 830, 7900)
)
monthwq has Aug but not Decec has Dec but not Augintersect(names(wq), names(ec))
## [1] "month"
This is what merge() uses by default for by. Always check this yourself first, especially with large data frames where columns may share names by accident.
merge(wq, ec)
## month DO pH Ecoli ## 1 Nov 8.9 7.7 830 ## 2 Oct 14.0 7.8 2080 ## 3 Sept 17.2 6.9 120
Aug dropped (not in ec); Dec dropped (not in wq)monthThis is the database equivalent of an inner join.
all = TRUEmerge(wq, ec, all = TRUE)
## month DO pH Ecoli ## 1 Aug 12.3 7.2 NA ## 2 Dec NA NA 7900 ## 3 Nov 8.9 7.7 830 ## 4 Oct 14.0 7.8 2080 ## 5 Sept 17.2 6.9 120
NAall.x = TRUEmerge(wq, ec, all.x = TRUE)
## month DO pH Ecoli ## 1 Aug 12.3 7.2 NA ## 2 Nov 8.9 7.7 830 ## 3 Oct 14.0 7.8 2080 ## 4 Sept 17.2 6.9 120
x (wq) kepty (ec) droppedAug stays; Dec goesall.y = TRUEmerge(wq, ec, all.y = TRUE)
## month DO pH Ecoli ## 1 Dec NA NA 7900 ## 2 Nov 8.9 7.7 830 ## 3 Oct 14.0 7.8 2080 ## 4 Sept 17.2 6.9 120
y (ec) keptx (wq) droppedDec stays; Aug goes| Type | R code | Keeps from x |
Keeps from y |
SQL equivalent |
|---|---|---|---|---|
| Inner | merge(x, y) |
Matched only | Matched only | INNER JOIN |
| Full outer | merge(x, y, all=T) |
All | All | FULL OUTER JOIN |
| Left | merge(x, y, all.x=T) |
All | Matched only | LEFT JOIN |
| Right | merge(x, y, all.y=T) |
Matched only | All | RIGHT JOIN |
Rule of thumb: if you’re not sure, start with all = TRUE so you can see what didn’t match, then decide what to discard.
Suppose the ec data used sampling_month instead of month:
ec2 <- ec names(ec2)[1] <- "sampling_month" names(ec2)
## [1] "sampling_month" "Ecoli"
by.x and by.ymerge(wq, ec2, by.x = "month", by.y = "sampling_month")
## month DO pH Ecoli ## 1 Nov 8.9 7.7 830 ## 2 Oct 14.0 7.8 2080 ## 3 Sept 17.2 6.9 120
by.x says which column in x to match onby.y says which column in y to match onxSometimes a single column isn’t enough to uniquely identify a row. Common in ecological data where you have site + year, or plot + date.
sites_a <- data.frame(
site = c("A", "A", "B", "B"),
year = c(2020, 2021, 2020, 2021),
temp = c(15.2, 16.1, 12.8, 13.5)
)
sites_b <- data.frame(
site = c("A", "A", "B", "B"),
year = c(2020, 2021, 2020, 2021),
precip = c(820, 790, 1100, 1050)
)
merge(sites_a, sites_b)
## site year temp precip ## 1 A 2020 15.2 820 ## 2 A 2021 16.1 790 ## 3 B 2020 12.8 1100 ## 4 B 2021 13.5 1050
intersect(names(sites_a), names(sites_b)) returns c("site", "year")suffixes argumentWhat happens when both data frames have a column with the same name that is not the merge key?
df1 <- data.frame(id = 1:3, value = c(10, 20, 30)) df2 <- data.frame(id = 1:3, value = c(100, 200, 300)) merge(df1, df2)
## [1] id value ## <0 rows> (or 0-length row.names)
The defaults .x and .y aren’t very informative. Use suffixes to make them meaningful:
merge(df1, df2, suffixes = c("_field", "_lab"))
## [1] id value ## <0 rows> (or 0-length row.names)
This is much more readable when you come back to your code six months later.
Not all merges are one-to-one. Sometimes you want to attach a lookup table where one frame has fewer rows:
observations <- data.frame(
species = c("PIST", "PIST", "ACRU", "ACRU", "QURU"),
dbh_cm = c(25, 32, 18, 22, 45)
)
lookup <- data.frame(
species = c("PIST", "ACRU", "QURU"),
common = c("White Pine", "Red Maple", "Red Oak")
)
merge(observations, lookup)
## species dbh_cm common ## 1 ACRU 18 Red Maple ## 2 ACRU 22 Red Maple ## 3 PIST 25 White Pine ## 4 PIST 32 White Pine ## 5 QURU 45 Red Oak
lookup has 3 rows, observations has 5observations matched to its common nameobservations are not lost—they expand to include the lookup infodim()Always check dimensions before and after merging.
dim(wq)
## [1] 4 3
dim(ec)
## [1] 4 2
dim(merge(wq, ec)) # inner
## [1] 3 4
dim(merge(wq, ec, all = TRUE)) # full outer
## [1] 5 4
If the merged frame is unexpectedly large, you likely have duplicate keys causing a Cartesian product (every row in x paired with every matching row in y).
d1 <- data.frame(id = c(1, 1, 2), val1 = c("a", "b", "c"))
d2 <- data.frame(id = c(1, 1, 2), val2 = c("x", "y", "z"))
merge(d1, d2)
## id val1 val2 ## 1 1 a x ## 2 1 a y ## 3 1 b x ## 4 1 b y ## 5 2 c z
id = 1 appears twice in both framesid = 1When merge() gives unexpected results:
intersect(names(x), names(y)) — are the right columns being matched?dim() before and after — did the row count change the way you expected?table(table(x[, by_cols])) should show all 1s if keys are unique"Sept" vs "September" will not matchidentical()identical() tests whether two objects are exactly the same:
wq2 <- merge(wq, ec, by = "month") wq3 <- merge(wq, ec, by = 1) wq4 <- merge(wq, ec) identical(wq2, wq3)
## [1] TRUE
identical(wq3, wq4)
## [1] TRUE
Useful for confirming that different merge() calls produce the same result.
dplyrThe dplyr package provides a family of _join() functions that do the same thing as merge() but with more explicit naming.
library(dplyr)
dplyr join functionsEach join type gets its own function—no toggling all, all.x, all.y:
dplyr function |
Equivalent merge() |
Keeps |
|---|---|---|
inner_join(x, y) |
merge(x, y) |
Rows in both |
full_join(x, y) |
merge(x, y, all = TRUE) |
All rows |
left_join(x, y) |
merge(x, y, all.x = TRUE) |
All rows from x |
right_join(x, y) |
merge(x, y, all.y = TRUE) |
All rows from y |
The by argument works similarly, and column name mismatches use by = c("name_in_x" = "name_in_y") syntax.
dplyr join examplesinner_join(wq, ec, by = "month")
## month DO pH Ecoli ## 1 Sept 17.2 6.9 120 ## 2 Oct 14.0 7.8 2080 ## 3 Nov 8.9 7.7 830
left_join(wq, ec, by = "month")
## month DO pH Ecoli ## 1 Aug 12.3 7.2 NA ## 2 Sept 17.2 6.9 120 ## 3 Oct 14.0 7.8 2080 ## 4 Nov 8.9 7.7 830
dplyrRecall ec2 where the column was named sampling_month:
full_join(wq, ec2, by = c("month" = "sampling_month"))
## month DO pH Ecoli ## 1 Aug 12.3 7.2 NA ## 2 Sept 17.2 6.9 120 ## 3 Oct 14.0 7.8 2080 ## 4 Nov 8.9 7.7 830 ## 5 Dec NA NA 7900
This is the dplyr equivalent of merge(wq, ec2, by.x = "month", by.y = "sampling_month", all = TRUE).
dplyrdplyr also has two joins that don’t add columns—they just filter rows:
semi_join(wq, ec, by = "month") # rows in wq that HAVE a match in ec
## month DO pH ## 1 Sept 17.2 6.9 ## 2 Oct 14.0 7.8 ## 3 Nov 8.9 7.7
anti_join(wq, ec, by = "month") # rows in wq that have NO match in ec
## month DO pH ## 1 Aug 12.3 7.2
These have no direct merge() equivalent and are genuinely useful for data QC—e.g., “which of my field sites are missing from the lab results?”
merge() vs dplyr joinsAdvantages of merge():
Advantages of dplyr joins:
merge() sorts by default)semi_join() and anti_join() are handy extrasdata.tabledata.table::merge() uses the same arguments as base merge(), so switching is nearly free[ syntax for joins (X[Y, on = ...]) which is extremely concise once learnedsqldfsqldf::sqldf() lets you write actual SQL queries directly on R data framesCreate these two data frames and perform the four types of merge (inner, full outer, left, right). For each, check the dimensions and identify which rows were kept or dropped:
growth <- data.frame(
plot = c("P1", "P2", "P3", "P4"),
height_cm = c(120, 95, 140, 110)
)
soil <- data.frame(
plot = c("P2", "P3", "P4", "P5"),
organic_pct = c(3.2, 5.1, 2.8, 4.5)
)
merge() combines two data frames by shared columnsintersect(names(x), names(y)) before mergingdim() before and afterby.x and by.y when column names differsuffixes to make duplicate column names informativeall = TRUE when exploring, then restrict as needed