ddply() and Dates

March 25, 2026

Today

plyr::summarize() and ddply()
class Date and date arithmetic
Combining the two: splitting and summarizing data by date-derived groups
Practice with Archer Creek Watershed data

The limitation of `aggregate()`

You already know aggregate():

data(airquality)
aggregate(Temp ~ Month, airquality, FUN = mean)

##   Month     Temp
## 1     5 65.54839
## 2     6 79.10000
## 3     7 83.90323
## 4     8 83.96774
## 5     9 76.90000

But aggregate() only accepts one aggregation function at a time.

aggregate(Temp ~ Month, airquality, FUN = c(mean, sd))

## Error in `get()`:
## ! object 'FUN' of mode 'function' was not found

`plyr::summarize()`

library(plyr)

summarize() creates a new data frame from summary calculations. Unlike aggregate(), you can apply multiple functions at once.

summarize(airquality, 
  Avg = mean(Temp), 
  StdDev = sd(Temp),
  Med = median(Temp))

##        Avg  StdDev Med
## 1 77.88235 9.46527  79

You can name the output columns whatever you want

`plyr::summarize()`

Applied functions can reference objects created earlier in the same call:

summarize(airquality, 
  Avg = mean(Temp), 
  Med = median(Temp), 
  Dif = abs(Avg - Med))

##        Avg Med      Dif
## 1 77.88235  79 1.117647

What capability are we missing?

`ddply()`

Also in the plyr package
“Split data frame, apply function, and return results in a data frame.” — ?ddply
Hadley Wickham (2011). “The Split-Apply-Combine Strategy for Data Analysis”, Journal of Statistical Software, 40(1), 1-29
dataframe in, dataframe out, apply

`ddply()` arguments

.data = data frame to be processed
.variables = variables to split data frame by
- a character vector
- as .()-quoted variables
- a formula
.fun = function to apply to each piece
... = arguments passed to .fun (e.g., na.rm)
- in our case .fun is summarize()

`ddply()` in action

ddply(airquality, "Month", summarize,
  Avg.Temp = mean(Temp),
  Max.Ozone = max(Ozone, na.rm = TRUE))

##   Month Avg.Temp Max.Ozone
## 1     5 65.54839       115
## 2     6 79.10000        71
## 3     7 83.90323       135
## 4     8 83.96774       168
## 5     9 76.90000        96

Splits airquality by Month, applies summarize() to each piece, returns one data frame

`ddply()`: three ways to specify groups

ddply(airquality, "Month", ...)          # character
ddply(airquality, .(Month), ...)         # .() notation
ddply(airquality, ~Month, ...)           # formula

All three produce identical results.

. is a function that quotes variable names without evaluating them (see ?.)
The formula interface uses ~ just like aggregate()

`ddply()`: multiple grouping variables

data(warpbreaks)
ddply(warpbreaks, .(wool, tension), summarize,
  Avg = mean(breaks),
  Med = median(breaks),
  Dif = abs(Avg - Med))

##   wool tension      Avg Med       Dif
## 1    A       L 44.55556  51 6.4444444
## 2    A       M 24.00000  21 3.0000000
## 3    A       H 24.55556  24 0.5555556
## 4    B       L 28.22222  29 0.7777778
## 5    B       M 28.77778  28 0.7777778
## 6    B       H 18.77778  17 1.7777778

class `Date`

Looks like a character; is not just a character
Dates are represented internally as the number of days since 1970-01-01
- Earlier dates are assigned negative values
Many methods work only with Date objects (arithmetic, sorting, plotting)
No times in class Date (see Date-Time classes for that)
- Highest level of resolution is a day

Conversion to class `Date`

x <- "03/25/2026"
class(x)

## [1] "character"

d <- as.Date(x, format = "%m/%d/%Y")
class(d)

## [1] "Date"

## [1] "2026-03-25"

?strptime for the full list of format codes

Common date format codes

?strptime

%m = month as decimal (01–12)
%d = day of month (01–31)
%Y = 4-digit year
%y = 2-digit year
%j = day of year (001–366)
%A = full weekday name
%B = full month name
%b = abbreviated month name

Alternate date notation

The separator and order don’t matter — just match the format argument to the input:

as.Date("03/25/2026", format = "%m/%d/%Y")

## [1] "2026-03-25"

as.Date("25-03-2026", format = "%d-%m-%Y")

## [1] "2026-03-25"

as.Date("2026|03|25", format = "%Y|%m|%d")

## [1] "2026-03-25"

The only super special character is %

Long-hand example

Echar <- "Wednesday March 25, 2026"
Edate <- as.Date(Echar, format = "%A %B %d, %Y")
Edate

## [1] "2026-03-25"

Enter separators (commas, spaces) verbatim in the format string

Conversion back to characters: `format()`

Edate <- as.Date("2026-03-25")
format(Edate, "%d-%m-%Y")

## [1] "25-03-2026"

format(Edate, "%B %Y")

## [1] "March 2026"

format(Edate, "This is day %j of %Y.")

## [1] "This is day 084 of 2026."

format() always returns a character, not a Date

Date arithmetic

Dates support arithmetic directly.

d1 <- as.Date("2014-10-01")
d2 <- as.Date("2026-03-25")
d2 - d1

## Time difference of 4193 days

d1 + 30

## [1] "2014-10-31"

Subtraction returns a difftime object
Addition/subtraction of integers adds/removes days

`difftime()`

For more control over units, use difftime():

ss  <- as.Date("1962-09-27")  # Silent Spring published
ddt <- as.Date("1973-12-31")  # US DDT ban
difftime(ddt, ss, units = "weeks")

## Time difference of 587.5714 weeks

units accepts "secs", "mins", "hours", "days", "weeks"

Extracting info from Dates

d <- as.Date("2026-03-25")
weekdays(d)

## [1] "Wednesday"

months(d)

## [1] "March"

quarters(d)

## [1] "Q1"

format(d, "%j")  # Julian day

## [1] "084"

These functions return characters, not Dates

Extracting year and month with `format()`

format(d, "%Y")       # year as character

## [1] "2026"

as.numeric(format(d, "%m"))  # month as number

## [1] 3

Building a date column

The airquality dataset has Month and Day columns but no actual Date column. Let’s build one.

# All observations are from 1973
airquality$Date <- as.Date(paste("1973", 
  airquality$Month, airquality$Day, sep = "-"))
head(airquality[, c("Month", "Day", "Date")])

##   Month Day       Date
## 1     5   1 1973-05-01
## 2     5   2 1973-05-02
## 3     5   3 1973-05-03
## 4     5   4 1973-05-04
## 5     5   5 1973-05-05
## 6     5   6 1973-05-06

paste() builds the date string; as.Date() parses it
Default format is "%Y-%m-%d" so no format argument needed

Why bother with a Date column?

Once you have a real Date, you can derive grouping variables you never had:

airquality$Weekday <- weekdays(airquality$Date)
airquality$Quarter <- quarters(airquality$Date)
airquality$Week    <- as.numeric(format(airquality$Date, "%U"))

Now we can split by weekday, quarter, or week number — none of which existed in the raw data.

`ddply()` by weekday

Do ozone levels differ by day of the week? (Hint: industrial emissions tend to drop on weekends.)

ddply(airquality, "Weekday", summarize,
  N = sum(!is.na(Ozone)),
  Avg.Ozone = mean(Ozone, na.rm = TRUE),
  Avg.Temp  = mean(Temp))

##     Weekday  N Avg.Ozone Avg.Temp
## 1    Friday 15  31.60000 77.18182
## 2    Monday 16  37.43750 77.90476
## 3  Saturday 15  50.33333 77.63636
## 4    Sunday 18  44.38889 77.50000
## 5  Thursday 16  38.62500 78.59091
## 6   Tuesday 19  46.47368 79.13636
## 7 Wednesday 17  44.64706 77.22727

`ddply()` by weekday

The output is alphabetical. To order by day of week:

airquality$Weekday <- factor(airquality$Weekday,
  levels = c("Sunday","Monday","Tuesday","Wednesday",
             "Thursday","Friday","Saturday"),
  ordered = TRUE)

ozone_by_day <- ddply(airquality, "Weekday", summarize,
  N = sum(!is.na(Ozone)),
  Avg.Ozone = round(mean(Ozone, na.rm = TRUE), 1))
ozone_by_day

##     Weekday  N Avg.Ozone
## 1    Sunday 18      44.4
## 2    Monday 16      37.4
## 3   Tuesday 19      46.5
## 4 Wednesday 17      44.6
## 5  Thursday 16      38.6
## 6    Friday 15      31.6
## 7  Saturday 15      50.3

`ddply()` by week number

Summarize temperature by week of the year:

temp_by_week <- ddply(airquality, "Week", summarize,
  N = length(Temp),
  Avg.Temp = mean(Temp),
  Range = max(Temp) - min(Temp))
head(temp_by_week, 8)

##   Week N Avg.Temp Range
## 1   17 5 66.20000    18
## 2   18 7 66.14286    15
## 3   19 7 63.85714    11
## 4   20 7 61.57143    16
## 5   21 7 73.14286    24
## 6   22 7 82.00000    23
## 7   23 7 84.28571    16
## 8   24 7 73.57143    12

Plotting weekly summaries

plot(Avg.Temp ~ Week, data = temp_by_week, type = "b",
  pch = 19, col = "darkred",
  ylab = expression("Mean Temperature ("*degree*"F)"),
  xlab = "Week of Year",
  main = "NYC Weekly Temperature, 1973")

ddply() produces a clean data frame ready for plotting

Multiple grouping variables

Split by both month and whether it is a weekend:

airquality$Weekend <- airquality$Weekday %in% 
  c("Saturday", "Sunday")

ddply(airquality, .(Month, Weekend), summarize,
  N = sum(!is.na(Ozone)),
  Avg.Ozone = round(mean(Ozone, na.rm = TRUE), 1))

##    Month Weekend  N Avg.Ozone
## 1      5   FALSE 21      24.7
## 2      5    TRUE  5      19.2
## 3      6   FALSE  5      19.4
## 4      6    TRUE  4      42.0
## 5      7   FALSE 19      56.5
## 6      7    TRUE  7      66.3
## 7      8   FALSE 19      58.3
## 8      8    TRUE  7      64.6
## 9      9   FALSE 19      28.3
## 10     9    TRUE 10      37.4

Using `ifelse()` to create custom groups

What if we want to group by the first vs. second half of each month?

airquality$Half <- ifelse(airquality$Day <= 15, 
  "1st Half", "2nd Half")

ddply(airquality, .(Month, Half), summarize,
  Avg.Wind = round(mean(Wind), 1),
  Avg.Temp = round(mean(Temp), 1))

##    Month     Half Avg.Wind Avg.Temp
## 1      5 1st Half     11.3     65.7
## 2      5 2nd Half     11.9     65.4
## 3      6 1st Half     10.8     82.6
## 4      6 2nd Half      9.7     75.6
## 5      7 1st Half      9.4     84.5
## 6      7 2nd Half      8.6     83.3
## 7      8 1st Half      8.9     85.1
## 8      8 2nd Half      8.7     82.9
## 9      9 1st Half      9.5     81.5
## 10     9 2nd Half     10.8     72.3

ifelse() creates the grouping variable; ddply() does the work

Practical example: Ackerman Clearing weather station

The data

Ackerman Clearing is a meteorological station in the Huntington Wildlife Forest, Adirondack Park. Daily observations include air temperature, precipitation, snow depth, and wind speed.

acw <- read.csv("ACW-met.csv")
str(acw)

## 'data.frame':    1058 obs. of  15 variables:
##  $ site           : chr  "Ackerman Clearing" "Ackerman Clearing" "Ackerman Clearing" "Ackerman Clearing" ...
##  $ site_abbr      : chr  "ackerman" "ackerman" "ackerman" "ackerman" ...
##  $ site_id        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ data_interval  : chr  "24 hour" "24 hour" "24 hour" "24 hour" ...
##  $ timestamp      : chr  "2018-01-18 00:00:00" "2018-01-19 00:00:00" "2018-01-20 00:00:00" "2018-01-21 00:00:00" ...
##  $ recnum         : int  497 498 499 500 501 502 503 504 505 506 ...
##  $ rain           : num  0 0 0 0.254 4.064 ...
##  $ snow_depth_mean: num  0.329 0.367 0.373 0.358 0.335 0.319 0.305 0.314 0.314 0.314 ...
##  $ snow_depth_min : num  0 0.343 0.361 0.349 0.317 0.317 0.292 0 0.293 0.307 ...
##  $ snow_depth_max : num  0.381 0.393 0.38 0.372 0.35 0.321 0.319 1.61 0.333 0.32 ...
##  $ air_temp_avg   : num  -9.62 -8.97 -7.077 -0.171 0.281 ...
##  $ air_temp_min   : num  -12.99 -12.1 -8.95 -3.21 -2.62 ...
##  $ air_temp_max   : num  -6.58 -6.26 -3.17 2.42 5.04 ...
##  $ windspeed_avg  : num  0.506 1.174 0.716 1.857 0.641 ...
##  $ windspeed_max  : num  2.24 3.54 3.42 5.38 3.25 ...

First look

head(acw[, c("timestamp","air_temp_avg","rain",
  "snow_depth_mean","windspeed_avg")])

##             timestamp air_temp_avg  rain snow_depth_mean windspeed_avg
## 1 2018-01-18 00:00:00       -9.620 0.000           0.329         0.506
## 2 2018-01-19 00:00:00       -8.970 0.000           0.367         1.174
## 3 2018-01-20 00:00:00       -7.077 0.000           0.373         0.716
## 4 2018-01-21 00:00:00       -0.171 0.254           0.358         1.857
## 5 2018-01-22 00:00:00        0.281 4.064           0.335         0.641
## 6 2018-01-23 00:00:00       -1.079 0.762           0.319         2.941

timestamp is a character — we need to convert it to a DateTime

You try…

head(acw)
?strptime
?DateTimeClasses

Converting the timestamp

class(acw$timestamp)

## [1] "character"

acw$Date <- as.Date(acw$timestamp, 
  format = "%Y-%m-%d %H:%M:%S")
class(acw$Date)

## [1] "Date"

range(acw$Date)

## [1] "2018-01-18" "2020-12-31"

About 3 years of daily data (2018–2020)

Deriving grouping variables from the Date

acw$Year    <- as.numeric(format(acw$Date, "%Y"))
acw$Month   <- as.numeric(format(acw$Date, "%m"))
acw$DOY     <- as.numeric(format(acw$Date, "%j"))
acw$Weekday <- weekdays(acw$Date)

head(acw[, c("Date","Year","Month","DOY")], 4)

##         Date Year Month DOY
## 1 2018-01-18 2018     1  18
## 2 2018-01-19 2018     1  19
## 3 2018-01-20 2018     1  20
## 4 2018-01-21 2018     1  21

Exercise

Use ddply() to give you mean, min, and max monthly temperatures by Year.

Monthly temperature by year

ddply(acw, .(Year, Month), plyr::summarize,
  Mean.Temp = round(mean(air_temp_avg), 1),
  Min.Temp  = round(min(air_temp_min), 1),
  Max.Temp  = round(max(air_temp_max), 1))

##    Year Month Mean.Temp Min.Temp Max.Temp
## 1  2018     1      -5.2    -19.5      6.1
## 2  2018     2      -4.5    -22.7     18.4
## 3  2018     3      -3.3    -21.0     12.1
## 4  2018     4       0.2    -14.0     18.6
## 5  2018     5      13.6     -1.4     28.2
## 6  2018     6      15.1      4.5     29.2
## 7  2018     7      20.1      6.6     33.1
## 8  2018     8      19.3      8.1     30.5
## 9  2018     9      15.3      2.4     30.9
## 10 2018    10       6.1     -7.5     25.8
## 11 2018    11      -2.9    -24.0     12.8
## 12 2018    12      -5.2    -16.4      9.9
## 13 2019     1     -10.2    -28.6      6.8
## 14 2019     2      -7.9    -24.0      8.1
## 15 2019     3      -4.2    -21.2     12.8
## 16 2019     4       4.1     -9.9     23.3
## 17 2019     5       9.7      0.2     28.7
## 18 2019     6      15.1      3.2     28.4
## 19 2019     7      19.4      8.7     30.2
## 20 2019     8      17.0      7.1     27.2
## 21 2019     9      13.3      1.4     26.4
## 22 2019    10       7.7     -1.9     23.5
## 23 2019    11      -2.0    -17.3     17.2
## 24 2019    12      -5.4    -25.0      9.1
## 25 2020     1      -5.5    -22.5     14.5
## 26 2020     2      -5.9    -25.4     11.3
## 27 2020     3      -0.5    -16.0     15.6
## 28 2020     4       2.7     -8.4     16.1
## 29 2020     5      10.7     -6.1     32.5
## 30 2020     6      15.8     -0.2     30.9
## 31 2020     7      20.3     11.8     32.1
## 32 2020     8      17.4      6.9     30.1
## 33 2020     9      13.2     -1.9     28.2
## 34 2020    10       7.0     -6.7     23.7
## 35 2020    11       3.0    -11.4     21.7
## 36 2020    12      -4.8    -18.2     12.8

Plotting monthly means

monthly <- ddply(acw, .(Year, Month), plyr::summarize,
  Mean.Temp = mean(air_temp_avg))

plot(Mean.Temp ~ Month, data = monthly, 
  col = as.factor(monthly$Year), pch = 19,
  ylab = expression("Mean Temperature ("*degree*"C)"),
  main = "Ackerman Clearing Monthly Temps")
legend("topleft", legend = unique(monthly$Year),
  col = 1:3, pch = 19, bty = "n")

Plotting monthly means

`ddply()` vs. `dplyr`

plyr pioneered split-apply-combine in R. dplyr (also by Wickham) is its successor, built for speed and pipelines.

The core translation:

# plyr
ddply(df, .(var1, var2), summarize, ...)

# dplyr
df |> group_by(var1, var2) |> summarise(...)

group_by() replaces the .variables argument
The pipe (|>) replaces nesting
summarise() replaces both .fun and ...

Side-by-side: simple grouping

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.2

# plyr
ddply(acw, .(Year, Month), plyr::summarize,
  Mean.Temp = round(mean(air_temp_avg), 1),
  Min.Temp  = round(min(air_temp_min), 1),
  Max.Temp  = round(max(air_temp_max), 1))

##    Year Month Mean.Temp Min.Temp Max.Temp
## 1  2018     1      -5.2    -19.5      6.1
## 2  2018     2      -4.5    -22.7     18.4
## 3  2018     3      -3.3    -21.0     12.1
## 4  2018     4       0.2    -14.0     18.6
## 5  2018     5      13.6     -1.4     28.2
## 6  2018     6      15.1      4.5     29.2
## 7  2018     7      20.1      6.6     33.1
## 8  2018     8      19.3      8.1     30.5
## 9  2018     9      15.3      2.4     30.9
## 10 2018    10       6.1     -7.5     25.8
## 11 2018    11      -2.9    -24.0     12.8
## 12 2018    12      -5.2    -16.4      9.9
## 13 2019     1     -10.2    -28.6      6.8
## 14 2019     2      -7.9    -24.0      8.1
## 15 2019     3      -4.2    -21.2     12.8
## 16 2019     4       4.1     -9.9     23.3
## 17 2019     5       9.7      0.2     28.7
## 18 2019     6      15.1      3.2     28.4
## 19 2019     7      19.4      8.7     30.2
## 20 2019     8      17.0      7.1     27.2
## 21 2019     9      13.3      1.4     26.4
## 22 2019    10       7.7     -1.9     23.5
## 23 2019    11      -2.0    -17.3     17.2
## 24 2019    12      -5.4    -25.0      9.1
## 25 2020     1      -5.5    -22.5     14.5
## 26 2020     2      -5.9    -25.4     11.3
## 27 2020     3      -0.5    -16.0     15.6
## 28 2020     4       2.7     -8.4     16.1
## 29 2020     5      10.7     -6.1     32.5
## 30 2020     6      15.8     -0.2     30.9
## 31 2020     7      20.3     11.8     32.1
## 32 2020     8      17.4      6.9     30.1
## 33 2020     9      13.2     -1.9     28.2
## 34 2020    10       7.0     -6.7     23.7
## 35 2020    11       3.0    -11.4     21.7
## 36 2020    12      -4.8    -18.2     12.8

Side-by-side: simple grouping

# dplyr

acw |>
  group_by(Year, Month) |>
  summarise(
    Mean.Temp = round(mean(air_temp_avg), 1),
    Min.Temp  = round(min(air_temp_min), 1),
    Max.Temp  = round(max(air_temp_max), 1),
    .groups = "drop") #Drops persistent grouping. See `?dplyr_by`.

## # A tibble: 36 × 5
##     Year Month Mean.Temp Min.Temp Max.Temp
##    <dbl> <dbl>     <dbl>    <dbl>    <dbl>
##  1  2018     1      -5.2    -19.5      6.1
##  2  2018     2      -4.5    -22.7     18.4
##  3  2018     3      -3.3    -21       12.1
##  4  2018     4       0.2    -14       18.6
##  5  2018     5      13.6     -1.4     28.2
##  6  2018     6      15.1      4.5     29.2
##  7  2018     7      20.1      6.6     33.1
##  8  2018     8      19.3      8.1     30.5
##  9  2018     9      15.3      2.4     30.9
## 10  2018    10       6.1     -7.5     25.8
## # ℹ 26 more rows

Identical results; dplyr returns a tibble (prints slightly differently)

References

?ddply, ?summarize (package plyr)
?as.Date, ?strptime, ?difftime, ?seq.Date
Wickham, H. (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1–29.
Ackerman Clearing meteorological data, Huntington Wildlife Forest, Adirondack Park (adk-ltm.org)
?airquality

Today

The limitation of aggregate()

plyr::summarize()

plyr::summarize()

ddply()

ddply() arguments

ddply() in action

ddply(): three ways to specify groups

ddply(): multiple grouping variables

class Date

class Date

Conversion to class Date

Common date format codes

Alternate date notation

Long-hand example

Conversion back to characters: format()

Date arithmetic

difftime()

Extracting info from Dates

Extracting year and month with format()

Building a date column

Why bother with a Date column?

ddply() by weekday

ddply() by weekday

ddply() by week number

Plotting weekly summaries

Multiple grouping variables

Using ifelse() to create custom groups

Practical example: Ackerman Clearing weather station

The data

First look

You try…

Converting the timestamp

Deriving grouping variables from the Date

Exercise

Monthly temperature by year

Plotting monthly means

Plotting monthly means

ddply() vs. dplyr

Side-by-side: simple grouping

Side-by-side: simple grouping

References

The limitation of `aggregate()`

`plyr::summarize()`

`plyr::summarize()`

`ddply()`

`ddply()` arguments

`ddply()` in action

`ddply()`: three ways to specify groups

`ddply()`: multiple grouping variables

class `Date`

class `Date`

Conversion to class `Date`

Conversion back to characters: `format()`

`difftime()`

Extracting year and month with `format()`

`ddply()` by weekday

`ddply()` by weekday

`ddply()` by week number

Using `ifelse()` to create custom groups

`ddply()` vs. `dplyr`