Goal: plotting many dimensions at once

A high-quality data visualization is to pack as much information as possible into an image, while still telling a coherent and legible story. Aesthetics are also important, though those are at some level subjective, there are some design principles that really facilitate visual communication.

In this lab, we will incrementally contruct the following figure in R:

Note that the information is contained in many forms: location on the plot, colors, sizes, and judicious labelling.

The data

The data are in the WDI.csv file, downloaded from the World Bank’s World Development Indicators via the WDI R package. It contains one row per country with variables including life expectancy, GDP per capita, population, land area, geographic coordinates, and World Bank region. Load it and inspect its structure before plotting.

wdi <- read.csv("WDI.csv")
str(wdi)
## 'data.frame':    208 obs. of  16 variables:
##  $ country    : chr  "India" "China" "United States" "Indonesia" ...
##  $ iso2c      : chr  "IN" "CN" "US" "ID" ...
##  $ iso3c      : chr  "IND" "CHN" "USA" "IDN" ...
##  $ year       : int  2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
##  $ status     : logi  NA NA NA NA NA NA ...
##  $ lastupdated: chr  "2026-01-28" "2026-01-28" "2026-01-28" "2026-01-28" ...
##  $ LifeExp    : num  71.7 78.2 77.4 70.9 67.4 ...
##  $ GDP        : num  2347 12971 76657 4731 1538 ...
##  $ Area       : num  2973190 9388210 9147420 1892555 770880 ...
##  $ Population : int  1425423212 1412175000 334017321 278830529 243700667 223150896 210306415 169384897 144236933 128613117 ...
##  $ region     : chr  "South Asia" "East Asia & Pacific" "North America" "East Asia & Pacific" ...
##  $ capital    : chr  "New Delhi" "Beijing" "Washington D.C." "Jakarta" ...
##  $ longitude  : num  77.2 116.3 -77 106.8 72.8 ...
##  $ latitude   : num  28.6 40 38.9 -6.2 30.5 ...
##  $ income     : chr  "Lower middle income" "Upper middle income" "High income" "Upper middle income" ...
##  $ lending    : chr  "IBRD" "IBRD" "Not classified" "IBRD" ...

Step 1 — a plain scatter plot

Start with the simplest possible plot: GDP per capita (divided by 1000 for readability) on the x-axis and life expectancy on the y-axis. This gives us the baseline to build from.

plot(wdi$GDP/1e3, wdi$LifeExp)

Very basic.

Step 1b — log scale on x

How to deal with clustering at various ends of this figure? GDP per capita is highly right-skewed — a handful of wealthy countries pull the scale so far that most of the variation among poorer countries is compressed into a narrow strip. A log x-axis spreads the data out and reveals the relationship much more clearly. While we’re at it, we polish the axes with par().

par(cex.lab = 1.2, las = 1,
    mgp = c(2, .25, 0), tck = 0.01)

plot(wdi$GDP/1e3, wdi$LifeExp,
     log  = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)")

log = "x" applies a log axis
(log = "y", log = "xy" also work)

las = 1 — horizontal tick labels
tck = 0.01 — inward ticks
mgp — controls axis label / tick label / tick line positions

par() reference

par() is R’s global graphics parameter function — it controls everything from margins to font family to tick direction. Changes made with par() persist for the rest of the plot (or until reset), so it’s usually called once at the top of a plotting block. Here are the parameters most relevant today.

parameter what it does
cex.lab axis label size
cex.axis tick label size
mar margins (bottom, left, top, right)
mgp axis title, tick label, tick line positions
las tick label orientation
tck tick length (+inward, −outward)
family font family ("serif", "sans", "mono")
bty box type ("o", "l", "u", "c", "n")
par("mar")   # default margins
## [1] 5.1 4.1 4.1 2.1
par("mgp")   # default mgp
## [1] 3 1 0

Step 2 — size points by population

A third quantitative variable — population — can be encoded as point size using the cex argument. We scale each country’s circle so that its area is proportional to population.

par(cex.lab = 1.2, las = 1,
    mgp = c(2, .25, 0), tck = 0.01)

plot(wdi$GDP/1e3, wdi$LifeExp,
     log  = "x",
     xlab = "GDP per capita ($1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     cex  = sqrt(wdi$Population /
                 max(wdi$Population,
                     na.rm = TRUE)) * 15)

cex controls point size
sqrt() makes area proportional to population

Why sqrt()?

A circle’s area = π r²

If we set cex proportional to population directly, the radius scales linearly — but our eye reads area.

To make area proportional to population:

\[r \propto \sqrt{\text{population}}\]

So we use:

cex = sqrt(Population / max_pop) * 15

The * 15 just scales up to a visible range.

Step 2b — add transparency

With variable-size points, overlap becomes a problem — large circles hide smaller ones beneath them. Transparency helps: points that overlap produce a visually darker region, preserving information about density. Use pch = 21 (filled circle) so border and fill can be controlled independently, and set the fill with rgb() including an alpha value.

par(cex.lab = 1.2, las = 1,
    mgp = c(2, .25, 0), tck = 0.01)

plot(wdi$GDP/1e3, wdi$LifeExp,
     log  = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     cex  = sqrt(wdi$Population /
                 max(wdi$Population,
                     na.rm = TRUE)) * 15,
     pch  = 21,
     col  = "grey40",
     bg   = rgb(0, 0, 0, 0.25))

pch = 21 — filled circles with separate border/fill
rgb(r, g, b, alpha) — alpha: 0 = transparent, 1 = opaque

Transparency reveals overlapping points.

rgb() and color in R

rgb() creates colors by specifying red, green, and blue channels (0–1 scale), plus an optional alpha (transparency) channel. The adjustcolor() base R function and scales::alpha() offer convenient shortcuts for adding transparency to named colors.

# Pure red, 50% transparent
rgb(1, 0, 0, 0.5)

# Any named color with alpha:
adjustcolor("steelblue", alpha.f = 0.4)

# scales package convenience function:
scales::alpha("steelblue", 0.4)
rgb(1, 0, 0, 0.5)
## [1] "#FF000080"
adjustcolor("steelblue", alpha.f = 0.4)
## [1] "#4682B466"

All values 0–1 (or use maxColorValue = 255)

Step 3 — color by region

Now we encode a fourth variable — World Bank region — as point color. We define a named color vector cols where each name is a region label, then use match() to look up the right color for every row in the data frame.

wdi$region <- factor(wdi$region)

cols <- c(
  "East Asia & Pacific"        = "steelblue",
  "Europe & Central Asia"      = "forestgreen",
  "Latin America & Caribbean"  = "goldenrod",
  "Middle East & North Africa" = "darkorange",
  "North America"              = "mediumpurple",
  "South Asia"                 = "firebrick",
  "Sub-Saharan Africa"         = "sienna")

cont_col <- cols[match(wdi$region,
                       levels(wdi$region))]

plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     cex  = sqrt(wdi$Population /
                 max(wdi$Population, na.rm=TRUE))*15,
     pch = 21, col = cont_col,
     bg  = scales::alpha(cont_col, .7))

How match() maps colors to categories

The trick is straightforward: factor() stores categories as integers (1, 2, 3, …) mapped to an ordered list of levels(). match() retrieves that integer for each observation, which we use as an index into our color vector. This pattern works for any named vector — colors, point characters, line types — not just cols.

levels() gives factor levels in order

levels(wdi$region)
## [1] "East Asia & Pacific"        "Europe & Central Asia"     
## [3] "Latin America & Caribbean"  "Middle East & North Africa"
## [5] "North America"              "South Asia"                
## [7] "Sub-Saharan Africa"

match() returns each row’s position in that ordered list

wdi$region[1:20]
##  [1] South Asia                 East Asia & Pacific       
##  [3] North America              East Asia & Pacific       
##  [5] South Asia                 Sub-Saharan Africa        
##  [7] Latin America & Caribbean  South Asia                
##  [9] Europe & Central Asia      Latin America & Caribbean 
## [11] Sub-Saharan Africa         East Asia & Pacific       
## [13] East Asia & Pacific        Middle East & North Africa
## [15] Sub-Saharan Africa         East Asia & Pacific       
## [17] Middle East & North Africa Europe & Central Asia     
## [19] Europe & Central Asia      East Asia & Pacific       
## 7 Levels: East Asia & Pacific ... Sub-Saharan Africa
match(wdi$region, levels(wdi$region))[1:20]
##  [1] 6 1 5 1 6 7 3 6 2 3 7 1 1 4 7 1 4 2 2 1

cols[match(...)] picks the right color by index

cols[match(wdi$region, levels(wdi$region))][1:20]
##                 South Asia        East Asia & Pacific 
##                "firebrick"                "steelblue" 
##              North America        East Asia & Pacific 
##             "mediumpurple"                "steelblue" 
##                 South Asia         Sub-Saharan Africa 
##                "firebrick"                   "sienna" 
##  Latin America & Caribbean                 South Asia 
##                "goldenrod"                "firebrick" 
##      Europe & Central Asia  Latin America & Caribbean 
##              "forestgreen"                "goldenrod" 
##         Sub-Saharan Africa        East Asia & Pacific 
##                   "sienna"                "steelblue" 
##        East Asia & Pacific Middle East & North Africa 
##                "steelblue"               "darkorange" 
##         Sub-Saharan Africa        East Asia & Pacific 
##                   "sienna"                "steelblue" 
## Middle East & North Africa      Europe & Central Asia 
##               "darkorange"              "forestgreen" 
##      Europe & Central Asia        East Asia & Pacific 
##              "forestgreen"                "steelblue"

Step 4 — add a legend

A color-coded plot is incomplete without a legend. The legend() function is flexible but has many arguments — the key ones for categorical point data are shown below.

par(cex.lab = 1.2, las = 1,
    mar = c(4, 4, 2, 2),
    mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     cex = sqrt(wdi$Population /
                max(wdi$Population, na.rm=TRUE))*15,
     pch = 21, col = cont_col,
     bg  = scales::alpha(cont_col, .7))

legend("bottomright", title = "Region",
       legend = levels(wdi$region),
       pt.bg  = cols,
       col    = "grey40",
       pch    = 21,
       pt.cex = 1.8,
       bty    = "n")

legend() key arguments

legend(
  x      = "bottomright", # position or coordinates
  legend = ...,           # text labels
  col    = ...,           # point/line colors
  pch    = ...,           # point character
  lty    = ...,           # line types (if lines)
  pt.cex = 2,             # symbol size in legend
  cex    = 0.9,           # text size
  bty    = "n",           # no box around legend
  title  = "Region"       # optional title
)

Position keywords:
"topleft", "topright", "bottomleft", "bottomright", "top", "right", …

Or give exact coordinates: legend(x = 0.5, y = 60, ...)

For filled circles (pch = 21), use both col (border) and pt.bg (fill):

legend("bottomright",
       legend = levels(wdi$region),
       pt.bg  = cols,      # fill
       col    = "grey40",  # border
       pch    = 21,
       pt.cex = 1.8,
       bty    = "n")

pch 21–25 are filled shapes — border color (col) and fill (bg or pt.bg) are independent.

Step 5 — reorder by population

R draws points in row order, so later rows are drawn on top of earlier ones. If we plot small countries last, they appear on top of and remain visible beneath the large-population circles. Sorting by decreasing population before plotting solves the occlusion problem.

# Sort descending so large (populous) countries
# are drawn first — smaller ones on top
wdi <- wdi[order(wdi$Population,
                 decreasing = TRUE), ]

cont_col <- cols[match(wdi$region,
                       levels(wdi$region))]
cex_pop  <- sqrt(wdi$Population /
                 max(wdi$Population,
                     na.rm = TRUE)) * 15

par(cex.lab = 1.2, las = 1,
    mar = c(4, 4, 2, 2),
    mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     cex = cex_pop, pch = 21,
     col = cont_col,
     bg  = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
       legend = levels(wdi$region),
       pt.bg = cols, col = "grey40",
       pch = 21, pt.cex = 1.8, bty = "n")

Large countries drawn first → small countries on top and visible.

Step 6 — label large countries

For the most populous countries, it’s worth printing the country name directly on the plot. We use subset() to filter to countries above a population threshold, then text() to draw labels at their coordinates. Scaling label size with cex_pop keeps large-country labels proportionally bigger.

par(cex.lab = 1.2, las = 1,
    mar = c(4, 4, 2, 2),
    mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     cex = cex_pop, pch = 21,
     col = cont_col,
     bg  = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
       legend = levels(wdi$region),
       pt.bg = cols, col = "grey40",
       pch = 21, pt.cex = 1.8, bty = "n")

wdi_big <- subset(wdi, Population > 1e8)

text(labels = wdi_big$country,
     x      = wdi_big$GDP/1e3,
     y      = wdi_big$LifeExp,
     cex    = cex_pop[wdi$Population > 1e8] / 5,
     adj    = 0.5)

adj = 0.5 — centered on the point
(adj: 0=left-justified, 1=right-justified)
font = 2bold
font = 3italic
pos = 4 — offset to the right of coordinates

Step 7 — custom grid lines with abline()

Reference grid lines help readers read off values. The key technique here is type = "n" — this draws the axes and sets up the plot region without plotting any points, letting us layer the grid first so points appear on top of it. We then use abline() to draw lines at specific values rather than grid(), which gives us full control over placement on a log axis.

par(cex.lab = 1.2, las = 1,
    mar = c(4, 4, 2, 2),
    mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)",
     type = "n")
abline(h = seq(40, 90, 10), col = "grey80", lty = 3)
abline(v = c(.5,1,2,5,10,20,50,100), col = "grey80", lty = 3)
points(wdi$GDP/1e3, wdi$LifeExp,
       cex = cex_pop, pch = 21,
       col = cont_col, bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
       legend = levels(wdi$region),
       pt.bg = cols, col = "grey40",
       pch = 21, pt.cex = 1.5, bty = "n", cex = 0.7)
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
     x = wdi_big$GDP/1e3, y = wdi_big$LifeExp,
     cex = cex_pop[wdi$Population > 1e8]/5, adj = 0.5)

abline(h = ...) — horizontal lines at given y values
abline(v = ...) — vertical lines at given x values

Preferred over grid() here because the x-axis is log-scale — grid() spaces lines at tick positions, abline() lets you choose exact values.

Step 8 — a second legend for point size

The region legend explains colors, but readers still need a reference for what the circle sizes mean. A second legend() call encodes population by using the exact same pt.cex formula as the plot, applied to a set of round reference values. This ensures perfect visual consistency between the legend symbols and the data points.

par(cex.lab = 1.2, las = 1,
    mar = c(4, 4, 2, 2),
    mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     xlab = "GDP per capita (1000 USD, log scale)",
     ylab = "Life expectancy (years)", type = "n")
abline(h = seq(40, 90, 10), col = "grey80", lty = 3)
abline(v = c(.5,1,2,5,10,20,50,100), col = "grey80", lty = 3)
points(wdi$GDP/1e3, wdi$LifeExp,
       cex = cex_pop, pch = 21,
       col = cont_col, bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
       legend = levels(wdi$region),
       pt.bg = cols, col = "grey40",
       pch = 21, pt.cex = 1.5, bty = "n", cex = 0.7)
legend("right", title = "Population (millions)",
       legend = c(10, 50, 100, NA, 500),
       pt.cex = sqrt(c(10, 50, 100, NA, 500)*1e6 / max(wdi$Population, na.rm = TRUE))*15,
       col = "darkgrey", pch = 21, pt.bg = "grey", bty = "n", ncol = 1)
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
     x = wdi_big$GDP/1e3, y = wdi_big$LifeExp,
     cex = cex_pop[wdi$Population > 1e8]/5, adj = 0.5)

Key trick: pt.cex uses the same formula as the plot, applied to round reference population values.

NA entries create blank rows — useful for spacing.

The full code

Here is everything assembled into a single self-contained block — from loading the data to the finished plot. This is the version to save, adapt, and build on.

wdi <- read.csv("WDI.csv")
wdi <- wdi[order(wdi$Population, decreasing = TRUE), ]
wdi$region <- factor(wdi$region)

cols <- c("East Asia & Pacific"        = "steelblue",
          "Europe & Central Asia"      = "forestgreen",
          "Latin America & Caribbean"  = "goldenrod",
          "Middle East & North Africa" = "darkorange",
          "North America"              = "mediumpurple",
          "South Asia"                 = "firebrick",
          "Sub-Saharan Africa"         = "sienna")

cont_col <- cols[match(wdi$region, levels(wdi$region))]
cex_pop  <- sqrt(wdi$Population / max(wdi$Population, na.rm = TRUE)) * 15

par(cex.lab = 1.2, mar = c(4, 4, 4, 2),
    mgp = c(2, .25, 0), tck = 0.01,
    family = "serif", las = 1)

plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
     ylab = "Life expectancy (years)",
     xlab = "GDP per capita (1000 USD, log scale)", type = "n")
title("Health vs. Wealth by Nations (2020)", cex.main = 2, font.main = 1)
title(sub = "World Development Indicators (World Bank) - accessed via the WDI package in R)")
abline(h = seq(40, 90, 5), col = "grey80", lty = 3)
abline(v = c(.5,1,2,5,10,20,50,100), col = "grey80", lty = 3)
points(wdi$GDP/1e3, wdi$LifeExp,
       cex = cex_pop + .5, pch = 21, col = cont_col,
       bg  = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
       legend = levels(wdi$region),
       pt.bg = cols, col = "grey40",
       pch = 21, pt.cex = 1.8, bty = "n")
legend("right", title = "Population (millions)",
       legend = c(10, 50, 100, NA, 500),
       pt.cex = sqrt(c(10, 50, 100, NA, 500)*1e6 /
                     max(wdi$Population, na.rm = TRUE))*15,
       col = "darkgrey", pch = 21, pt.bg = "grey", bty = "n")
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
     x = wdi_big$GDP/1e3, y = wdi_big$LifeExp,
     cex = cex_pop[wdi$Population > 1e8]/5, adj = 0.5)

Summary: dimensions encoded in one plot

Visual channel Variable R tool
x position (log) GDP per capita log = "x"
y position Life expectancy plot(x, y)
Point size Population cex = sqrt(pop/max_pop) * 15
Point color Region named cols + match()
Text labels Country name text(labels = ...)

Key functions:

par() · plot() · points() · abline() · rgb() · scales::alpha() · match() · legend() · order() · subset() · text() · title()

Exercise

The wdi data includes geographic coordinates (longitude, latitude), land area, and income bracket — enough to make a completely different kind of multi-dimensional plot: a world map made entirely in base R. Apply everything from today to build it.

Using the wdi data make a scatter plot of latitude vs. longitude

  • Size points by Area
  • Color the points by region
  • Make the color transparency proportional to the density: i.e. Population/Area
  • Make the shape of the point (e.g. circle, square, triangle … options in ?points) reflect the income bracket.
  • Label the capitals of the 20 largest countries
  • Create a legend labeling the point types / income brackets and region colors
  • Make it PRETTY!