A high-quality data visualization is to pack as much information as possible into an image, while still telling a coherent and legible story. Aesthetics are also important, though those are at some level subjective, there are some design principles that really facilitate visual communication.
In this lab, we will incrementally contruct the following figure in R:
Note that the information is contained in many forms: location on the plot, colors, sizes, and judicious labelling.
The data are in the WDI.csv file,
downloaded from the World Bank’s World Development Indicators via the
WDI R package. It contains one row per country with
variables including life expectancy, GDP per capita, population, land
area, geographic coordinates, and World Bank region. Load it and inspect
its structure before plotting.
wdi <- read.csv("WDI.csv")
str(wdi)
## 'data.frame': 208 obs. of 16 variables:
## $ country : chr "India" "China" "United States" "Indonesia" ...
## $ iso2c : chr "IN" "CN" "US" "ID" ...
## $ iso3c : chr "IND" "CHN" "USA" "IDN" ...
## $ year : int 2022 2022 2022 2022 2022 2022 2022 2022 2022 2022 ...
## $ status : logi NA NA NA NA NA NA ...
## $ lastupdated: chr "2026-01-28" "2026-01-28" "2026-01-28" "2026-01-28" ...
## $ LifeExp : num 71.7 78.2 77.4 70.9 67.4 ...
## $ GDP : num 2347 12971 76657 4731 1538 ...
## $ Area : num 2973190 9388210 9147420 1892555 770880 ...
## $ Population : int 1425423212 1412175000 334017321 278830529 243700667 223150896 210306415 169384897 144236933 128613117 ...
## $ region : chr "South Asia" "East Asia & Pacific" "North America" "East Asia & Pacific" ...
## $ capital : chr "New Delhi" "Beijing" "Washington D.C." "Jakarta" ...
## $ longitude : num 77.2 116.3 -77 106.8 72.8 ...
## $ latitude : num 28.6 40 38.9 -6.2 30.5 ...
## $ income : chr "Lower middle income" "Upper middle income" "High income" "Upper middle income" ...
## $ lending : chr "IBRD" "IBRD" "Not classified" "IBRD" ...
Start with the simplest possible plot: GDP per capita (divided by 1000 for readability) on the x-axis and life expectancy on the y-axis. This gives us the baseline to build from.
plot(wdi$GDP/1e3, wdi$LifeExp)
Very basic.
How to deal with clustering at various ends of this figure? GDP per
capita is highly right-skewed — a handful of wealthy countries pull the
scale so far that most of the variation among poorer countries is
compressed into a narrow strip. A log x-axis spreads the data out and
reveals the relationship much more clearly. While we’re at it, we polish
the axes with par().
par(cex.lab = 1.2, las = 1,
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp,
log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)")
log = "x" applies a log axis
(log = "y", log = "xy" also work)
las = 1 — horizontal tick labels
tck = 0.01 — inward ticks
mgp — controls axis label / tick label / tick line
positions
par() referencepar() is R’s global graphics parameter function — it
controls everything from margins to font family to tick direction.
Changes made with par() persist for the rest of the plot
(or until reset), so it’s usually called once at the top of a plotting
block. Here are the parameters most relevant today.
| parameter | what it does |
|---|---|
cex.lab |
axis label size |
cex.axis |
tick label size |
mar |
margins (bottom, left, top, right) |
mgp |
axis title, tick label, tick line positions |
las |
tick label orientation |
tck |
tick length (+inward, −outward) |
family |
font family ("serif", "sans",
"mono") |
bty |
box type ("o", "l", "u",
"c", "n") |
par("mar") # default margins
## [1] 5.1 4.1 4.1 2.1
par("mgp") # default mgp
## [1] 3 1 0
A third quantitative variable — population — can be encoded as point
size using the cex argument. We scale each country’s circle
so that its area is proportional to population.
par(cex.lab = 1.2, las = 1,
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp,
log = "x",
xlab = "GDP per capita ($1000 USD, log scale)",
ylab = "Life expectancy (years)",
cex = sqrt(wdi$Population /
max(wdi$Population,
na.rm = TRUE)) * 15)
cex controls point size
sqrt() makes area proportional to
population
sqrt()?A circle’s area = π r²
If we set cex proportional to population directly, the
radius scales linearly — but our eye reads
area.
To make area proportional to population:
\[r \propto \sqrt{\text{population}}\]
So we use:
cex = sqrt(Population / max_pop) * 15
The * 15 just scales up to a visible range.
With variable-size points, overlap becomes a problem — large circles
hide smaller ones beneath them. Transparency helps: points that overlap
produce a visually darker region, preserving information about density.
Use pch = 21 (filled circle) so border and fill can be
controlled independently, and set the fill with rgb()
including an alpha value.
par(cex.lab = 1.2, las = 1,
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp,
log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)",
cex = sqrt(wdi$Population /
max(wdi$Population,
na.rm = TRUE)) * 15,
pch = 21,
col = "grey40",
bg = rgb(0, 0, 0, 0.25))
pch = 21 — filled circles with separate
border/fill
rgb(r, g, b, alpha) — alpha: 0 = transparent, 1 =
opaque
Transparency reveals overlapping points.
rgb() and color in Rrgb() creates colors by specifying red, green, and blue
channels (0–1 scale), plus an optional alpha (transparency) channel. The
adjustcolor() base R function and
scales::alpha() offer convenient shortcuts for adding
transparency to named colors.
# Pure red, 50% transparent
rgb(1, 0, 0, 0.5)
# Any named color with alpha:
adjustcolor("steelblue", alpha.f = 0.4)
# scales package convenience function:
scales::alpha("steelblue", 0.4)
rgb(1, 0, 0, 0.5)
## [1] "#FF000080"
adjustcolor("steelblue", alpha.f = 0.4)
## [1] "#4682B466"
All values 0–1 (or use maxColorValue = 255)
Now we encode a fourth variable — World Bank region — as point color.
We define a named color vector cols where each name is a
region label, then use match() to look up the right color
for every row in the data frame.
wdi$region <- factor(wdi$region)
cols <- c(
"East Asia & Pacific" = "steelblue",
"Europe & Central Asia" = "forestgreen",
"Latin America & Caribbean" = "goldenrod",
"Middle East & North Africa" = "darkorange",
"North America" = "mediumpurple",
"South Asia" = "firebrick",
"Sub-Saharan Africa" = "sienna")
cont_col <- cols[match(wdi$region,
levels(wdi$region))]
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)",
cex = sqrt(wdi$Population /
max(wdi$Population, na.rm=TRUE))*15,
pch = 21, col = cont_col,
bg = scales::alpha(cont_col, .7))
match() maps colors to categoriesThe trick is straightforward: factor() stores categories
as integers (1, 2, 3, …) mapped to an ordered list of
levels(). match() retrieves that integer for
each observation, which we use as an index into our color vector. This
pattern works for any named vector — colors, point characters, line
types — not just cols.
levels() gives factor levels in order
levels(wdi$region)
## [1] "East Asia & Pacific" "Europe & Central Asia"
## [3] "Latin America & Caribbean" "Middle East & North Africa"
## [5] "North America" "South Asia"
## [7] "Sub-Saharan Africa"
match() returns each row’s position in that ordered
list
wdi$region[1:20]
## [1] South Asia East Asia & Pacific
## [3] North America East Asia & Pacific
## [5] South Asia Sub-Saharan Africa
## [7] Latin America & Caribbean South Asia
## [9] Europe & Central Asia Latin America & Caribbean
## [11] Sub-Saharan Africa East Asia & Pacific
## [13] East Asia & Pacific Middle East & North Africa
## [15] Sub-Saharan Africa East Asia & Pacific
## [17] Middle East & North Africa Europe & Central Asia
## [19] Europe & Central Asia East Asia & Pacific
## 7 Levels: East Asia & Pacific ... Sub-Saharan Africa
match(wdi$region, levels(wdi$region))[1:20]
## [1] 6 1 5 1 6 7 3 6 2 3 7 1 1 4 7 1 4 2 2 1
cols[match(...)] picks the right color by index
cols[match(wdi$region, levels(wdi$region))][1:20]
## South Asia East Asia & Pacific
## "firebrick" "steelblue"
## North America East Asia & Pacific
## "mediumpurple" "steelblue"
## South Asia Sub-Saharan Africa
## "firebrick" "sienna"
## Latin America & Caribbean South Asia
## "goldenrod" "firebrick"
## Europe & Central Asia Latin America & Caribbean
## "forestgreen" "goldenrod"
## Sub-Saharan Africa East Asia & Pacific
## "sienna" "steelblue"
## East Asia & Pacific Middle East & North Africa
## "steelblue" "darkorange"
## Sub-Saharan Africa East Asia & Pacific
## "sienna" "steelblue"
## Middle East & North Africa Europe & Central Asia
## "darkorange" "forestgreen"
## Europe & Central Asia East Asia & Pacific
## "forestgreen" "steelblue"
A color-coded plot is incomplete without a legend. The
legend() function is flexible but has many arguments — the
key ones for categorical point data are shown below.
par(cex.lab = 1.2, las = 1,
mar = c(4, 4, 2, 2),
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)",
cex = sqrt(wdi$Population /
max(wdi$Population, na.rm=TRUE))*15,
pch = 21, col = cont_col,
bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
legend = levels(wdi$region),
pt.bg = cols,
col = "grey40",
pch = 21,
pt.cex = 1.8,
bty = "n")
legend() key argumentslegend(
x = "bottomright", # position or coordinates
legend = ..., # text labels
col = ..., # point/line colors
pch = ..., # point character
lty = ..., # line types (if lines)
pt.cex = 2, # symbol size in legend
cex = 0.9, # text size
bty = "n", # no box around legend
title = "Region" # optional title
)
Position keywords:
"topleft", "topright",
"bottomleft", "bottomright",
"top", "right", …
Or give exact coordinates:
legend(x = 0.5, y = 60, ...)
For filled circles (pch = 21), use both
col (border) and pt.bg (fill):
legend("bottomright",
legend = levels(wdi$region),
pt.bg = cols, # fill
col = "grey40", # border
pch = 21,
pt.cex = 1.8,
bty = "n")
pch 21–25 are filled shapes — border
color (col) and fill (bg or
pt.bg) are independent.
R draws points in row order, so later rows are drawn on top of earlier ones. If we plot small countries last, they appear on top of and remain visible beneath the large-population circles. Sorting by decreasing population before plotting solves the occlusion problem.
# Sort descending so large (populous) countries
# are drawn first — smaller ones on top
wdi <- wdi[order(wdi$Population,
decreasing = TRUE), ]
cont_col <- cols[match(wdi$region,
levels(wdi$region))]
cex_pop <- sqrt(wdi$Population /
max(wdi$Population,
na.rm = TRUE)) * 15
par(cex.lab = 1.2, las = 1,
mar = c(4, 4, 2, 2),
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)",
cex = cex_pop, pch = 21,
col = cont_col,
bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
legend = levels(wdi$region),
pt.bg = cols, col = "grey40",
pch = 21, pt.cex = 1.8, bty = "n")
Large countries drawn first → small countries on top and visible.
For the most populous countries, it’s worth printing the country name
directly on the plot. We use subset() to filter to
countries above a population threshold, then text() to draw
labels at their coordinates. Scaling label size with
cex_pop keeps large-country labels proportionally
bigger.
par(cex.lab = 1.2, las = 1,
mar = c(4, 4, 2, 2),
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)",
cex = cex_pop, pch = 21,
col = cont_col,
bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
legend = levels(wdi$region),
pt.bg = cols, col = "grey40",
pch = 21, pt.cex = 1.8, bty = "n")
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
x = wdi_big$GDP/1e3,
y = wdi_big$LifeExp,
cex = cex_pop[wdi$Population > 1e8] / 5,
adj = 0.5)
adj = 0.5 — centered on the point
(adj: 0=left-justified, 1=right-justified)
font = 2 — bold
font = 3 — italic
pos = 4 — offset to the right of coordinates
abline()Reference grid lines help readers read off values. The key technique
here is type = "n" — this draws the axes and sets up the
plot region without plotting any points, letting us layer the grid
first so points appear on top of it. We then use
abline() to draw lines at specific values rather than
grid(), which gives us full control over placement on a log
axis.
par(cex.lab = 1.2, las = 1,
mar = c(4, 4, 2, 2),
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)",
type = "n")
abline(h = seq(40, 90, 10), col = "grey80", lty = 3)
abline(v = c(.5,1,2,5,10,20,50,100), col = "grey80", lty = 3)
points(wdi$GDP/1e3, wdi$LifeExp,
cex = cex_pop, pch = 21,
col = cont_col, bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
legend = levels(wdi$region),
pt.bg = cols, col = "grey40",
pch = 21, pt.cex = 1.5, bty = "n", cex = 0.7)
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
x = wdi_big$GDP/1e3, y = wdi_big$LifeExp,
cex = cex_pop[wdi$Population > 1e8]/5, adj = 0.5)
abline(h = ...) — horizontal lines at given y
values
abline(v = ...) — vertical lines at given x values
Preferred over grid() here because the x-axis is
log-scale — grid() spaces lines at tick positions,
abline() lets you choose exact values.
The region legend explains colors, but readers still need a reference
for what the circle sizes mean. A second legend() call
encodes population by using the exact same pt.cex formula
as the plot, applied to a set of round reference values. This ensures
perfect visual consistency between the legend symbols and the data
points.
par(cex.lab = 1.2, las = 1,
mar = c(4, 4, 2, 2),
mgp = c(2, .25, 0), tck = 0.01)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
xlab = "GDP per capita (1000 USD, log scale)",
ylab = "Life expectancy (years)", type = "n")
abline(h = seq(40, 90, 10), col = "grey80", lty = 3)
abline(v = c(.5,1,2,5,10,20,50,100), col = "grey80", lty = 3)
points(wdi$GDP/1e3, wdi$LifeExp,
cex = cex_pop, pch = 21,
col = cont_col, bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
legend = levels(wdi$region),
pt.bg = cols, col = "grey40",
pch = 21, pt.cex = 1.5, bty = "n", cex = 0.7)
legend("right", title = "Population (millions)",
legend = c(10, 50, 100, NA, 500),
pt.cex = sqrt(c(10, 50, 100, NA, 500)*1e6 / max(wdi$Population, na.rm = TRUE))*15,
col = "darkgrey", pch = 21, pt.bg = "grey", bty = "n", ncol = 1)
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
x = wdi_big$GDP/1e3, y = wdi_big$LifeExp,
cex = cex_pop[wdi$Population > 1e8]/5, adj = 0.5)
Key trick: pt.cex uses the same formula
as the plot, applied to round reference population values.
NA entries create blank rows — useful for spacing.
Here is everything assembled into a single self-contained block — from loading the data to the finished plot. This is the version to save, adapt, and build on.
wdi <- read.csv("WDI.csv")
wdi <- wdi[order(wdi$Population, decreasing = TRUE), ]
wdi$region <- factor(wdi$region)
cols <- c("East Asia & Pacific" = "steelblue",
"Europe & Central Asia" = "forestgreen",
"Latin America & Caribbean" = "goldenrod",
"Middle East & North Africa" = "darkorange",
"North America" = "mediumpurple",
"South Asia" = "firebrick",
"Sub-Saharan Africa" = "sienna")
cont_col <- cols[match(wdi$region, levels(wdi$region))]
cex_pop <- sqrt(wdi$Population / max(wdi$Population, na.rm = TRUE)) * 15
par(cex.lab = 1.2, mar = c(4, 4, 4, 2),
mgp = c(2, .25, 0), tck = 0.01,
family = "serif", las = 1)
plot(wdi$GDP/1e3, wdi$LifeExp, log = "x",
ylab = "Life expectancy (years)",
xlab = "GDP per capita (1000 USD, log scale)", type = "n")
title("Health vs. Wealth by Nations (2020)", cex.main = 2, font.main = 1)
title(sub = "World Development Indicators (World Bank) - accessed via the WDI package in R)")
abline(h = seq(40, 90, 5), col = "grey80", lty = 3)
abline(v = c(.5,1,2,5,10,20,50,100), col = "grey80", lty = 3)
points(wdi$GDP/1e3, wdi$LifeExp,
cex = cex_pop + .5, pch = 21, col = cont_col,
bg = scales::alpha(cont_col, .7))
legend("bottomright", title = "Region",
legend = levels(wdi$region),
pt.bg = cols, col = "grey40",
pch = 21, pt.cex = 1.8, bty = "n")
legend("right", title = "Population (millions)",
legend = c(10, 50, 100, NA, 500),
pt.cex = sqrt(c(10, 50, 100, NA, 500)*1e6 /
max(wdi$Population, na.rm = TRUE))*15,
col = "darkgrey", pch = 21, pt.bg = "grey", bty = "n")
wdi_big <- subset(wdi, Population > 1e8)
text(labels = wdi_big$country,
x = wdi_big$GDP/1e3, y = wdi_big$LifeExp,
cex = cex_pop[wdi$Population > 1e8]/5, adj = 0.5)
| Visual channel | Variable | R tool |
|---|---|---|
| x position (log) | GDP per capita | log = "x" |
| y position | Life expectancy | plot(x, y) |
| Point size | Population | cex = sqrt(pop/max_pop) * 15 |
| Point color | Region | named cols + match() |
| Text labels | Country name | text(labels = ...) |
Key functions:
par() · plot() · points() ·
abline() · rgb() ·
scales::alpha() · match() ·
legend() · order() · subset() ·
text() · title()
The wdi data includes geographic coordinates
(longitude, latitude), land area, and income
bracket — enough to make a completely different kind of
multi-dimensional plot: a world map made entirely in base R. Apply
everything from today to build it.
Using the wdi data make a scatter plot of
latitude vs. longitude
?points) reflect the income
bracket.