- Regular expressions: pattern matching in R
- Extracting fields from messy text
- Matching while excluding: negative lookaheads
- Using Claude responsibly along the way
April 6, 2026
A pattern that describes a set of strings.
Functions in base R that work with regular expressions:
grep() – find which elements match (returns indices or values)grepl() – TRUE/FALSE for each elementregexpr() – position and length of first matchregmatches() – extract matched text (pair with regexpr() or regexec())regexec() – like regexpr() but also returns capture group positionssub() / gsub() – find and replace (first / all)| Token | Meaning | Example |
|---|---|---|
\\d |
any digit | \\d+ matches “123” |
\\w |
word character (letter, digit, _) | \\w+ matches “Plot_01” |
\\s |
whitespace | \\s+ matches spaces, tabs |
. |
any character | a.b matches “a1b”, “axb” |
+ |
one or more | \\d+ matches “1” or “999” |
* |
zero or more | \\d* matches “” or “42” |
? |
zero or one (optional) | \\d? matches “” or “7” |
| Token | Meaning | Negation | Meaning |
|---|---|---|---|
\\d |
any digit (0-9) | \\D |
any non-digit |
\\w |
word character (letter, digit, _) | \\W |
any non-word character |
\\s |
whitespace (space, tab, newline) | \\S |
any non-whitespace |
\\b |
word boundary | \\B |
non-boundary position |
| Token | Meaning | Example |
|---|---|---|
{n} |
exactly n times | \\d{4} matches “2025” |
[A-Z] |
character class (range) | one uppercase letter |
[^abc] |
negated class | anything except a, b, c |
^ / $ |
start / end of string | ^> matches “>” at start |
\\b |
word boundary | \\bcoli\\b won’t match “colicin” |
\\. |
literal dot | escaped because . = “any char” |
(...) |
capture group | extract what’s inside |
| Token | Meaning | Example |
|---|---|---|
(?!...) |
negative lookahead: “not followed by” | (?!alba) excludes alba |
(?=...) |
positive lookahead: “followed by” | \\d+(?= cm) matches digits before ” cm” |
(?<=...) |
positive lookbehind: “preceded by” | (?<=USGS )\\d+ matches digits after “USGS” |
(?<!...) |
negative lookbehind: “not preceded by” | (?<!Dr\\.)\\s[A-Z] skips titles |
.+? |
lazy quantifier: match as few as possible | .+?\\s stops at first space |
| |
alternation (OR) | partial|complete matches either |
perl = TRUE is required for lookaheads in R.
Download ncbi_16S_headers.txt from the course website:
seqs <- readLines("ncbi_16S_headers.txt")
seqs[1:3]
## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence" ## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence" ## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence"
grep() and grepl()Which sequences are Pseudomonas?
grep("Pseudomonas", seqs)
## [1] 2 5 9
grep("Pseudomonas", seqs, value = TRUE)
## [1] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence" ## [2] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence" ## [3] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"
grep() with value = TRUE returns the matching strings themselves.
grepl() returns a logical vectorgrepl("partial", seqs)
## [1] TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
Useful for subsetting: seqs[grepl("partial", seqs)]
regexpr() + regmatches()grep() tells you which strings match. To extract the matched text, pair regexpr() with regmatches():
regmatches(seqs, regexpr("NR_\\d+\\.\\d+", seqs))
## [1] "NR_074769.1" "NR_028687.1" "NR_112116.1" "NR_041263.1" "NR_044946.1" ## [6] "NR_113266.1" "NR_036861.1" "NR_074828.1" "NR_025530.1" "NR_116594.1"
regexpr() finds the position and length of the first matchregmatches() uses that to pull out the matched substringNR_ is literal, \\d+ is digits, \\. is a literal dotregexpr() Onlyregexpr("NR_\\d+\\.\\d+", seqs)
## [1] 2 2 2 2 2 2 2 2 2 2 ## attr(,"match.length") ## [1] 11 11 11 11 11 11 11 11 11 11 ## attr(,"index.type") ## [1] "chars" ## attr(,"useBytes") ## [1] TRUE
regexpr() finds the position and length of the first matchregmatches(seqs, regexpr("[A-Z][a-z]+\\s[a-z]+", seqs))
## [1] "Escherichia coli" "Pseudomonas fluorescens" ## [3] "Bacillus cereus" "Clostridium botulinum" ## [5] "Pseudomonas putida" "Bacillus subtilis" ## [7] "Streptomyces griseus" "Escherichia coli" ## [9] "Pseudomonas aeruginosa" "Clostridium difficile"
[A-Z][a-z]+ – capitalized genus\\s – space[a-z]+ – lowercase epithetA taxonomic naming convention encoded as a pattern.
Return Pseudomonas spp., but exclude P. aeruginosa. This is a challenging one…this is a negative lookahead.
grep("______", seqs, value = TRUE, perl = TRUE)
(?!...)grep("Pseudomonas\\s+(?!aeruginosa\\b)\\w+", seqs,
value = TRUE, perl = TRUE)
## [1] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence" ## [2] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence"
(?!aeruginosa\\b) – “not followed by aeruginosa”\\b – word boundary prevents partial matchesperl = TRUE – required for lookaheads in base Rgrep("Pseudomonas\\s+(?!aeruginosa\\b|putida\\b)\\w+", seqs,
value = TRUE, perl = TRUE)
## [1] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence"
| inside the lookahead means ORgrep("^(?!.*Clostridium)", seqs, value = TRUE, perl = TRUE)
## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence" ## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence" ## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence" ## [4] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence" ## [5] ">NR_113266.1 Bacillus subtilis strain JCM 1465 16S ribosomal RNA, complete sequence" ## [6] ">NR_036861.1 Streptomyces griseus strain NBRC 13350 16S ribosomal RNA, partial sequence" ## [7] ">NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence" ## [8] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"
^(?!.*Clostridium) – “from the start, never encounter ‘Clostridium’ anywhere on this line”Another option: postively match Clostridium, but use invert = TRUE to invert the selection.
grep("Clostridium", seqs, value = TRUE, perl = TRUE, invert = TRUE)
## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence" ## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence" ## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence" ## [4] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence" ## [5] ">NR_113266.1 Bacillus subtilis strain JCM 1465 16S ribosomal RNA, complete sequence" ## [6] ">NR_036861.1 Streptomyces griseus strain NBRC 13350 16S ribosomal RNA, partial sequence" ## [7] ">NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence" ## [8] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"
An option that’s much more restrictive (fixed = TRUE), but simpler to write.
grep("Clostridium", seqs, value = TRUE, fixed = TRUE, invert = TRUE)
## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence" ## [2] ">NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence" ## [3] ">NR_112116.1 Bacillus cereus strain ATCC 14579 16S ribosomal RNA, partial sequence" ## [4] ">NR_044946.1 Pseudomonas putida strain NBRC 14164 16S ribosomal RNA, partial sequence" ## [5] ">NR_113266.1 Bacillus subtilis strain JCM 1465 16S ribosomal RNA, complete sequence" ## [6] ">NR_036861.1 Streptomyces griseus strain NBRC 13350 16S ribosomal RNA, partial sequence" ## [7] ">NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence" ## [8] ">NR_025530.1 Pseudomonas aeruginosa strain DSM 50071 16S ribosomal RNA, partial sequence"
We want five fields from each header in a data frame: accession, genus, species, strain, and coverage.
That’s harder than anything we’ve written so far. This is where Claude can help – if you use it well.
| Problem | Finding | Source |
|---|---|---|
| Cognitive offloading | LLM users underperformed no-tool users across neural and behavioral measures over 4 months | Kosmyna et al., 2025 |
| Frequent AI use negatively correlated with critical thinking (n=666) | Gerlich, 2025 | |
| Invalid output | 20% of LLM-recommended packages were fabricated across 576k code samples | UT/VT/OU joint study |
| 29-45% of AI-generated code contains security vulnerabilities | diffray.ai, 2025 | |
| Energy footprint | Typical query: ~0.3 Wh; reasoning models: 7-40 Wh (up to 100x more) | Epoch AI; Google; IEEE Spectrum, 2025 |
| 2.5B queries/day = ~850 MWh/day, equivalent to ~29,000 US homes | IEEE Spectrum, 2025 |
| Problem | Best practice |
|---|---|
| Cognitive offloading | Write your attempt before prompting |
| Ask Claude to explain, not just write | |
| Invalid output | Always run the code and check edge cases |
| Include example data in your prompt | |
| Energy footprint | One specific prompt beats five vague ones |
| Use the smallest model that fits the task (i.e., Sonnet vs Opus) |
To Claude (Sonnet 4.6):
I have NCBI 16S rRNA FASTA headers like:
>NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial sequence >NR_028687.1 Pseudomonas fluorescens strain ATCC 13525 16S ribosomal RNA, complete sequence >NR_074828.1 Escherichia coli strain K-12 substr. MG1655 16S ribosomal RNA, complete sequence
Write a regex with regexec() in base R that extracts: accession, genus, species, strain, and partial/complete. Explain each capture group. Show your energy usage for the chat in laptop run time (minutes).
Here’s a regex with five capture groups:
">(NR_\\d+\\.\\d+)\\s+([A-Z][a-z]+)\\s([a-z]+)\\s+strain\\s+(.+?)\\s+16S.*?(partial|complete)"
(NR_\\d+\\.\\d+) – accession with version([A-Z][a-z]+) – genus (capitalized)([a-z]+) – species epithet (lowercase)(.+?) – strain, lazy match to stop at first “16S”(partial|complete) – sequence coverageUse regexec() with perl = TRUE, then regmatches() to extract.
Energy: ~0.3 Wh for this query. At 7W for a laptop screen, that’s about 2.6 minutes of screen time.
regexec() + regmatches()regexec() does what regexpr() does but also returns capture group positions. Paired with regmatches(), you get the full match and each group:
pattern <- paste0( ">(NR_\\d+\\.\\d+)", # group 1: accession "\\s+([A-Z][a-z]+)", # group 2: genus "\\s([a-z]+)", # group 3: species "\\s+strain\\s+(.+?)", # group 4: strain (lazy) "\\s+16S.*?", # skip to coverage "(partial|complete)" # group 5: coverage ) m <- regmatches(seqs, regexec(pattern, seqs, perl = TRUE)) m[[1]]
## [1] ">NR_074769.1 Escherichia coli strain U 5/41 16S ribosomal RNA, partial" ## [2] "NR_074769.1" ## [3] "Escherichia" ## [4] "coli" ## [5] "U 5/41" ## [6] "partial"
regexec() Onlyregexec(pattern, seqs, perl = TRUE)
## [[1]] ## [1] 1 2 14 26 38 64 ## attr(,"match.length") ## [1] 70 11 11 4 6 7 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[2]] ## [1] 1 2 14 26 45 75 ## attr(,"match.length") ## [1] 82 11 11 11 10 8 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[3]] ## [1] 1 2 14 23 37 67 ## attr(,"match.length") ## [1] 73 11 8 6 10 7 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[4]] ## [1] 1 2 14 26 43 73 ## attr(,"match.length") ## [1] 80 11 11 9 10 8 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[5]] ## [1] 1 2 14 26 40 70 ## attr(,"match.length") ## [1] 76 11 11 6 10 7 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[6]] ## [1] 1 2 14 23 39 67 ## attr(,"match.length") ## [1] 74 11 8 8 8 8 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[7]] ## [1] 1 2 14 27 42 72 ## attr(,"match.length") ## [1] 78 11 12 7 10 7 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[8]] ## [1] 1 2 14 26 38 77 ## attr(,"match.length") ## [1] 84 11 11 4 19 8 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[9]] ## [1] 1 2 14 26 44 73 ## attr(,"match.length") ## [1] 79 11 11 10 9 7 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars" ## ## [[10]] ## [1] 1 2 14 26 43 72 ## attr(,"match.length") ## [1] 79 11 11 9 9 8 ## attr(,"useBytes") ## [1] TRUE ## attr(,"index.type") ## [1] "chars"
m is a list of character vectors. Each vector has the full match at [1] and capture groups at [2:6]. To get a data frame:
result <- do.call(rbind, lapply(m, function(x) x[2:6]))
result <- as.data.frame(result, stringsAsFactors = FALSE)
names(result) <- c("accession", "genus", "species",
"strain", "coverage")
do.call(rbind, lapply(m, function(x) x[2:6]))
lapply(m, function(x) x[2:6]) – loop over the list, pull out positions 2–6 (the five capture groups) from each elementdo.call(rbind, ...) – take that list and stack each element as a row in a matrixrbind(m[[1]][2:6], m[[2]][2:6], ...) but without all that typingresult
## accession genus species strain coverage ## 1 NR_074769.1 Escherichia coli U 5/41 partial ## 2 NR_028687.1 Pseudomonas fluorescens ATCC 13525 complete ## 3 NR_112116.1 Bacillus cereus ATCC 14579 partial ## 4 NR_041263.1 Clostridium botulinum ATCC 25763 complete ## 5 NR_044946.1 Pseudomonas putida NBRC 14164 partial ## 6 NR_113266.1 Bacillus subtilis JCM 1465 complete ## 7 NR_036861.1 Streptomyces griseus NBRC 13350 partial ## 8 NR_074828.1 Escherichia coli K-12 substr. MG1655 complete ## 9 NR_025530.1 Pseudomonas aeruginosa DSM 50071 partial ## 10 NR_116594.1 Clostridium difficile ATCC 9689 complete
Download usgs_gauges.txt. Each line looks like:
USGS 04240105 Onondaga Creek at Dorwin Ave., Syracuse NY lat:43.0281 long:-76.1522 drain_area_sqmi:82.2
Extract two features into a data frame:
readLines() to loadregmatches() + regexpr() for eachApply the three rules: try it yourself first, run and check, craft one good prompt if you ask Claude.
(?<=USGS ) – positive lookbehind: match digits after “USGS”