Regular Expressions in R

April 6, 2026

Today

Regular expressions: pattern matching in R
Extracting fields from messy text
Matching while excluding: negative lookaheads
Using Claude responsibly along the way

Regex Basics

What is a regular expression?

A pattern that describes a set of strings.

Functions in base R that work with regular expressions:

grep() – find which elements match (returns indices or values)
grepl() – TRUE/FALSE for each element
regexpr() – position and length of first match
regmatches() – extract matched text (pair with regexpr() or regexec())
regexec() – like regexpr() but also returns capture group positions
sub() / gsub() – find and replace (first / all)

Regex vocabulary

Token	Meaning	Example
`\\d`	any digit	`\\d+` matches “123”
`\\w`	word character (letter, digit, _)	`\\w+` matches “Plot_01”
`\\s`	whitespace	`\\s+` matches spaces, tabs
`.`	any character	`a.b` matches “a1b”, “axb”
`+`	one or more	`\\d+` matches “1” or “999”
`*`	zero or more	`\\d*` matches “” or “42”
`?`	zero or one (optional)	`\\d?` matches “” or “7”

Regex vocabulary (continued)

Token	Meaning	Negation	Meaning
`\\d`	any digit (0-9)	`\\D`	any non-digit
`\\w`	word character (letter, digit, _)	`\\W`	any non-word character
`\\s`	whitespace (space, tab, newline)	`\\S`	any non-whitespace
`\\b`	word boundary	`\\B`	non-boundary position

Regex vocabulary (continued)

Token	Meaning	Example
`{n}`	exactly n times	`\\d{4}` matches “2025”
`[A-Z]`	character class (range)	one uppercase letter
`[^abc]`	negated class	anything except a, b, c
`^` / `$`	start / end of string	`^>` matches “>” at start
`\\b`	word boundary	`\\bcoli\\b` won’t match “colicin”
`\\.`	literal dot	escaped because `.` = “any char”
`(...)`	capture group	extract what’s inside

Regex vocabulary (continued)

Token	Meaning	Example
`(?!...)`	negative lookahead: “not followed by”	`(?!alba)` excludes alba
`(?=...)`	positive lookahead: “followed by”	`\\d+(?= cm)` matches digits before ” cm”
`(?<=...)`	positive lookbehind: “preceded by”	`(?<=USGS )\\d+` matches digits after “USGS”
`(?<!...)`	negative lookbehind: “not preceded by”	`(?<!Dr\\.)\\s[A-Z]` skips titles
`.+?`	lazy quantifier: match as few as possible	`.+?\\s` stops at first space
`\|`	alternation (OR)	`partial\|complete` matches either

perl = TRUE is required for lookaheads in R.

Extracting from NCBI Headers

The data

Download ncbi_16S_headers.txt from the course website:

seqs <- readLines("ncbi_16S_headers.txt")
seqs[1:3]

## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence"            
## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"
## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence"

Starting simple: `grep()` and `grepl()`

Which sequences are Pseudomonas?

grep("Pseudomonas", seqs)

## [1] 2 5 9

grep("Pseudomonas", seqs, value = TRUE)

## [1] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"
## [2] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence"      
## [3] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"

grep() with value = TRUE returns the matching strings themselves.

`grepl()` returns a logical vector

grepl("partial", seqs)

##  [1]  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Useful for subsetting: seqs[grepl("partial", seqs)]

Extracting text: `regexpr()` + `regmatches()`

grep() tells you which strings match. To extract the matched text, pair regexpr() with regmatches():

regmatches(seqs, regexpr("NR_\\d+\\.\\d+", seqs))

##  [1] "NR_074769.1" "NR_028687.1" "NR_112116.1" "NR_041263.1" "NR_044946.1"
##  [6] "NR_113266.1" "NR_036861.1" "NR_074828.1" "NR_025530.1" "NR_116594.1"

regexpr() finds the position and length of the first match
regmatches() uses that to pull out the matched substring
NR_ is literal, \\d+ is digits, \\. is a literal dot

`regexpr()` Only

regexpr("NR_\\d+\\.\\d+", seqs)

##  [1] 2 2 2 2 2 2 2 2 2 2
## attr(,"match.length")
##  [1] 11 11 11 11 11 11 11 11 11 11
## attr(,"index.type")
## [1] "chars"
## attr(,"useBytes")
## [1] TRUE

regexpr() finds the position and length of the first match

Extract genus and species

regmatches(seqs, regexpr("[A-Z][a-z]+\\s[a-z]+", seqs))

##  [1] "Escherichia coli"        "Pseudomonas fluorescens"
##  [3] "Bacillus cereus"         "Clostridium botulinum"  
##  [5] "Pseudomonas putida"      "Bacillus subtilis"      
##  [7] "Streptomyces griseus"    "Escherichia coli"       
##  [9] "Pseudomonas aeruginosa"  "Clostridium difficile"

[A-Z][a-z]+ – capitalized genus
\\s – space
[a-z]+ – lowercase epithet

A taxonomic naming convention encoded as a pattern.

Your turn

Return Pseudomonas spp., but exclude P. aeruginosa. This is a challenging one…this is a negative lookahead.

grep("______", seqs, value = TRUE, perl = TRUE)

Matching While Excluding

Negative lookahead: `(?!...)`

grep("Pseudomonas\\s+(?!aeruginosa\\b)\\w+", seqs, 
     value = TRUE, perl = TRUE)

## [1] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"
## [2] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence"

(?!aeruginosa\\b) – “not followed by aeruginosa”
\\b – word boundary prevents partial matches
perl = TRUE – required for lookaheads in base R

Exclude multiple species

grep("Pseudomonas\\s+(?!aeruginosa\\b|putida\\b)\\w+", seqs, 
     value = TRUE, perl = TRUE)

## [1] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"

| inside the lookahead means OR
Stack as many exclusions as you need

Exclude entire lines

grep("^(?!.*Clostridium)", seqs, value = TRUE, perl = TRUE)

## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence"              
## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"  
## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence"           
## [4] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence"        
## [5] ">NR_113266.1 Bacillus subtilis strain JCM 1465 16S ribosomal RNA, complete sequence"          
## [6] ">NR_036861.1 Streptomyces griseus strain NBRC 13350 16S ribosomal RNA, partial sequence"      
## [7] ">NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence"
## [8] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"

^(?!.*Clostridium) – “from the start, never encounter ‘Clostridium’ anywhere on this line”

Exclude entire lines

Another option: postively match Clostridium, but use invert = TRUE to invert the selection.

grep("Clostridium", seqs, value = TRUE, perl = TRUE, invert = TRUE)

## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence"              
## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"  
## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence"           
## [4] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence"        
## [5] ">NR_113266.1 Bacillus subtilis strain JCM 1465 16S ribosomal RNA, complete sequence"          
## [6] ">NR_036861.1 Streptomyces griseus strain NBRC 13350 16S ribosomal RNA, partial sequence"      
## [7] ">NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence"
## [8] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"

Exclude entire lines

An option that’s much more restrictive (fixed = TRUE), but simpler to write.

grep("Clostridium", seqs, value = TRUE, fixed = TRUE, invert = TRUE)

## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence"              
## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"  
## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence"           
## [4] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence"        
## [5] ">NR_113266.1 Bacillus subtilis strain JCM 1465 16S ribosomal RNA, complete sequence"          
## [6] ">NR_036861.1 Streptomyces griseus strain NBRC 13350 16S ribosomal RNA, partial sequence"      
## [7] ">NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence"
## [8] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"

Not working with regular expressions anymore. Verbatim matches only.

Using Claude to Build a Harder Regex

A Harder Task

We want five fields from each header in a data frame: accession, genus, species, strain, and coverage.

That’s harder than anything we’ve written so far. This is where Claude can help – if you use it well.

Major Issues with AI (at least)

Cognitive offloading, leading to skill atrophy
Validity of output
Environmental footprint

Problem	Finding	Source
Cognitive offloading	LLM users underperformed no-tool users across neural and behavioral measures over 4 months	Kosmyna et al., 2025
	Frequent AI use negatively correlated with critical thinking (n=666)	Gerlich, 2025
Invalid output	20% of LLM-recommended packages were fabricated across 576k code samples	UT/VT/OU joint study
	29-45% of AI-generated code contains security vulnerabilities	diffray.ai, 2025
Energy footprint	Typical query: ~0.3 Wh; reasoning models: 7-40 Wh (up to 100x more)	Epoch AI; Google; IEEE Spectrum, 2025
	2.5B queries/day = ~850 MWh/day, equivalent to ~29,000 US homes	IEEE Spectrum, 2025

Using AI well

Problem	Best practice
Cognitive offloading	Write your attempt before prompting
	Ask Claude to explain, not just write
Invalid output	Always run the code and check edge cases
	Include example data in your prompt
Energy footprint	One specific prompt beats five vague ones
	Use the smallest model that fits the task (i.e., Sonnet vs Opus)

The prompt

To Claude (Sonnet 4.6):

I have NCBI 16S rRNA FASTA headers like:

>NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence >NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence >NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence

Write a regex with regexec() in base R that extracts: accession, genus, species, strain, and partial/complete. Explain each capture group. Show your energy usage for the chat in laptop run time (minutes).

Claude’s response

Here’s a regex with five capture groups:

">(NR_\\d+\\.\\d+)\\s+([A-Z][a-z]+)\\s([a-z]+)\\s+strain\\s+(.+?)\\s+16S.*?(partial|complete)"

Group 1 (NR_\\d+\\.\\d+) – accession with version
Group 2 ([A-Z][a-z]+) – genus (capitalized)
Group 3 ([a-z]+) – species epithet (lowercase)
Group 4 (.+?) – strain, lazy match to stop at first “16S”
Group 5 (partial|complete) – sequence coverage

Use regexec() with perl = TRUE, then regmatches() to extract.

Energy: ~0.3 Wh for this query. At 7W for a laptop screen, that’s about 2.6 minutes of screen time.

`regexec()` + `regmatches()`

regexec() does what regexpr() does but also returns capture group positions. Paired with regmatches(), you get the full match and each group:

pattern <- paste0(
  ">(NR_\\d+\\.\\d+)",        # group 1: accession
  "\\s+([A-Z][a-z]+)",        # group 2: genus
  "\\s([a-z]+)",              # group 3: species
  "\\s+strain\\s+(.+?)",     # group 4: strain (lazy)
  "\\s+16S.*?",              # skip to coverage
  "(partial|complete)"        # group 5: coverage
)
m <- regmatches(seqs, regexec(pattern, seqs, perl = TRUE))
m[[1]]

## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial"
## [2] "NR_074769.1"                                                           
## [3] "Escherichia"                                                           
## [4] "coli"                                                                  
## [5] "U 5/41"                                                                
## [6] "partial"

`regexec()` Only

regexec(pattern, seqs, perl = TRUE)

## [[1]]
## [1]  1  2 14 26 38 64
## attr(,"match.length")
## [1] 70 11 11  4  6  7
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[2]]
## [1]  1  2 14 26 45 75
## attr(,"match.length")
## [1] 82 11 11 11 10  8
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[3]]
## [1]  1  2 14 23 37 67
## attr(,"match.length")
## [1] 73 11  8  6 10  7
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[4]]
## [1]  1  2 14 26 43 73
## attr(,"match.length")
## [1] 80 11 11  9 10  8
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[5]]
## [1]  1  2 14 26 40 70
## attr(,"match.length")
## [1] 76 11 11  6 10  7
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[6]]
## [1]  1  2 14 23 39 67
## attr(,"match.length")
## [1] 74 11  8  8  8  8
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[7]]
## [1]  1  2 14 27 42 72
## attr(,"match.length")
## [1] 78 11 12  7 10  7
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[8]]
## [1]  1  2 14 26 38 77
## attr(,"match.length")
## [1] 84 11 11  4 19  8
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[9]]
## [1]  1  2 14 26 44 73
## attr(,"match.length")
## [1] 79 11 11 10  9  7
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"
## 
## [[10]]
## [1]  1  2 14 26 43 72
## attr(,"match.length")
## [1] 79 11 11  9  9  8
## attr(,"useBytes")
## [1] TRUE
## attr(,"index.type")
## [1] "chars"

Building a data frame from the matches

m is a list of character vectors. Each vector has the full match at [1] and capture groups at [2:6]. To get a data frame:

result <- do.call(rbind, lapply(m, function(x) x[2:6]))
result <- as.data.frame(result, stringsAsFactors = FALSE)
names(result) <- c("accession", "genus", "species", 
                    "strain", "coverage")

Unpacking that line

do.call(rbind, lapply(m, function(x) x[2:6]))

lapply(m, function(x) x[2:6]) – loop over the list, pull out positions 2–6 (the five capture groups) from each element
do.call(rbind, ...) – take that list and stack each element as a row in a matrix
Same as rbind(m[[1]][2:6], m[[2]][2:6], ...) but without all that typing

Result

result

##      accession        genus     species              strain coverage
## 1  NR_074769.1  Escherichia        coli              U 5/41  partial
## 2  NR_028687.1  Pseudomonas fluorescens          ATCC 13525 complete
## 3  NR_112116.1     Bacillus      cereus          ATCC 14579  partial
## 4  NR_041263.1  Clostridium   botulinum          ATCC 25763 complete
## 5  NR_044946.1  Pseudomonas      putida          NBRC 14164  partial
## 6  NR_113266.1     Bacillus    subtilis            JCM 1465 complete
## 7  NR_036861.1 Streptomyces     griseus          NBRC 13350  partial
## 8  NR_074828.1  Escherichia        coli K-12 substr. MG1655 complete
## 9  NR_025530.1  Pseudomonas  aeruginosa           DSM 50071  partial
## 10 NR_116594.1  Clostridium   difficile           ATCC 9689 complete

Closing

Considerations for Regular Expressions

All dependent on how well you know your data
Formatting standards are crucial
Testing your regex is crucial
If possible, use data frames with one variable assigned to one column

AI Energy Usage (approximate)

A typical Claude query: ~0.3 Wh (Epoch AI; Google, 2025)
Complex reasoning queries: 7–40 Wh
Five sloppy prompts = 5x the energy of one good one
Asking Claude to develop the regex took an equivalent of roughly 5 min of laptop power including generating a diagram showing me how it works.
- Is that worth it? It would have taken me maybe 2 hours on my laptop.
- What about water usage for cooling? Other unaccounted for environmental impacts?

Take-home Exercise

USGS streamgage data

Download usgs_gauges.txt. Each line looks like:

USGS 04240105 Onondaga Creek at Dorwin Ave., Syracuse NY lat:43.0281 long:-76.1522 drain_area_sqmi:82.2

Extract two features into a data frame:

station number (e.g., “04240105”)
drainage area (e.g., 82.2)

Hints

readLines() to load
Station number: pattern right after “USGS”
Drainage area: pattern after “drain_area_sqmi:”
regmatches() + regexpr() for each

Apply the three rules: try it yourself first, run and check, craft one good prompt if you ask Claude.

Solution (for next class)

(?<=USGS ) – positive lookbehind: match digits after “USGS”
Lookbehinds check what comes before the match

Today

Regex Basics

What is a regular expression?

Regex vocabulary

Regex vocabulary (continued)

Regex vocabulary (continued)

Regex vocabulary (continued)

Extracting from NCBI Headers

The data

Starting simple: grep() and grepl()

grepl() returns a logical vector

Extracting text: regexpr() + regmatches()

regexpr() Only

Extract genus and species

Your turn

Matching While Excluding

Negative lookahead: (?!...)

Exclude multiple species

Exclude entire lines

Exclude entire lines

Exclude entire lines

Using Claude to Build a Harder Regex

A Harder Task

Major Issues with AI (at least)

Using AI well

The prompt

Claude’s response

regexec() + regmatches()

regexec() Only

Building a data frame from the matches

Unpacking that line

Result

Closing

Considerations for Regular Expressions

AI Energy Usage (approximate)

Take-home Exercise

USGS streamgage data

Hints

Solution (for next class)

Starting simple: `grep()` and `grepl()`

`grepl()` returns a logical vector

Extracting text: `regexpr()` + `regmatches()`

`regexpr()` Only

Negative lookahead: `(?!...)`

`regexec()` + `regmatches()`

`regexec()` Only