Basic String Functions

We’ve been dealig with strings for a while now, but before we proceed there are two basic functions that deal with strings that you should be aware of. The first is paste() which pastes strings together. A few examples of using paste are below:

paste("It was the best of times", "it was the worst of times")
## [1] "It was the best of times it was the worst of times"
paste("It was the", c("best", "worst"))
## [1] "It was the best"  "It was the worst"
paste("It was the", c("best", "worst"), "of times")
## [1] "It was the best of times"  "It was the worst of times"
paste("It was the best of times", "it was the worst of times", sep = ", ")
## [1] "It was the best of times, it was the worst of times"
paste(c("best", "worst"), collapse = "")
## [1] "bestworst"
paste("It was the", c("best", "worst"), collapse = "")
## [1] "It was the bestIt was the worst"

Another generally useful string function is nchar(), which counts the number of characters in a string.

nchar("Hello, it's me")
## [1] 14
nchar(c("I", "was", "wondering", "if", "after", "all", "these", "years", "you'd", "like", "to", "meet"))
##  [1] 1 3 9 2 5 3 5 5 5 4 2 4

Regular expressions

(For context, Perl is a programming language)

xkcd Perl Problems

In extraordinarily broad terms data is stored on computers either as binary data or as text. The practical difference between these two methods of storing data is that text is comprehensible to the human eye. Although a dataset might be comprehensible to your eye, it may not be comprehensible to a computer. In order to make text data comprehensible to a computer, data often needs to be cleaned.

Data can often be cleaned programmatically, in other words, often you can write a computer program (an R function) to systematically clean data. You can get text data into R using readLines() which returns a character vector where each element of the vector is a line of the text file you provided as an argument. You might want to search these strings for patterns so that you can manipulate those patterns somehow.

Regular expressions are a feature that are built into many programming langauges that can be used for searching strings. You can create a regular expression by combining metacharacters, some of which are listed below.

Metacharacters

Character Explanation
\\ Specifies a metacharacter
. Any character
* PE zero or more times
? PE zero or one times
+ PE one or more times
{x} PE x times
{x,} PE x or more times
{x,y} PE x or more and y or fewer times
[abc] Any characters within the brackets
[^abc] Any characters except those between the brackets
[a-z] Any characters between a and z
[^a-z] Any characters except those between a and z
a|b a or b
\\d Any digit
\\D Any non-digit
\\s Any space
\\S Any non-space
\\w Any ‘word’
\\W Any ‘non-word’
^ Start of string
$ End of string

Using Regular Expressions in R

The grep() function takes two arguments, a pattern, and text to search.

grep
grepl
sub
gsub
grepl("\\B", c("aa", "b", "c", "."))
## [1]  TRUE FALSE FALSE FALSE

stringr

str_count()
str_detect()
str_dup()
str_extract()
str_match()
str_replace()
str_split()
str_to_lower()
str_to_title()
str_to_upper()
str_trim()
thanks_path <- file.path(R.home("doc"), "THANKS")
thanks <- str_c(readLines(thanks_path), collapse = "\n")
thanks <- word(thanks, 1, 3, fixed("\n\n"))
cat(str_wrap(thanks), "\n")
## R would not be what it is today without the invaluable help of these people,
## who contributed by donating code, bug fixes and documentation: Valerio Aimale,
## Thomas Baier, Henrik Bengtsson, Roger Bivand, Ben Bolker, David Brahm, G"oran
## Brostr"om, Patrick Burns, Vince Carey, Saikat DebRoy, Brian D'Urso, Lyndon
## Drake, Dirk Eddelbuettel, Claus Ekstrom, Sebastian Fischmeister, John Fox,
## Paul Gilbert, Yu Gong, Gabor Grothendieck, Frank E Harrell Jr, Torsten Hothorn,
## Robert King, Kjetil Kjernsmo, Roger Koenker, Philippe Lambert, Jan de Leeuw,
## Jim Lindsey, Patrick Lindsey, Catherine Loader, Gordon Maclean, John Maindonald,
## David Meyer, Ei-ji Nakama, Jens Oehlschaegel, Steve Oncley, Richard O'Keefe,
## Hubert Palme, Roger D. Peng, Jose' C. Pinheiro, Tony Plate, Anthony Rossini,
## Jonathan Rougier, Petr Savicky, Guenther Sawitzki, Marc Schwartz, Detlef Steuer,
## Bill Simpson, Gordon Smyth, Adrian Trapletti, Terry Therneau, Rolf Turner,
## Bill Venables, Gregory R. Warnes, Andreas Weingessel, Morten Welinder, James
## Wettenhall, Simon Wood, and Achim Zeileis. Others have written code that has
## been adopted by R and is acknowledged in the code files, including
cat(str_wrap(thanks, width = 40), "\n")
## R would not be what it is today without
## the invaluable help of these people,
## who contributed by donating code, bug
## fixes and documentation: Valerio Aimale,
## Thomas Baier, Henrik Bengtsson, Roger
## Bivand, Ben Bolker, David Brahm, G"oran
## Brostr"om, Patrick Burns, Vince Carey,
## Saikat DebRoy, Brian D'Urso, Lyndon
## Drake, Dirk Eddelbuettel, Claus Ekstrom,
## Sebastian Fischmeister, John Fox, Paul
## Gilbert, Yu Gong, Gabor Grothendieck,
## Frank E Harrell Jr, Torsten Hothorn,
## Robert King, Kjetil Kjernsmo, Roger
## Koenker, Philippe Lambert, Jan de
## Leeuw, Jim Lindsey, Patrick Lindsey,
## Catherine Loader, Gordon Maclean, John
## Maindonald, David Meyer, Ei-ji Nakama,
## Jens Oehlschaegel, Steve Oncley,
## Richard O'Keefe, Hubert Palme, Roger
## D. Peng, Jose' C. Pinheiro, Tony Plate,
## Anthony Rossini, Jonathan Rougier,
## Petr Savicky, Guenther Sawitzki, Marc
## Schwartz, Detlef Steuer, Bill Simpson,
## Gordon Smyth, Adrian Trapletti, Terry
## Therneau, Rolf Turner, Bill Venables,
## Gregory R. Warnes, Andreas Weingessel,
## Morten Welinder, James Wettenhall, Simon
## Wood, and Achim Zeileis. Others have
## written code that has been adopted by R
## and is acknowledged in the code files,
## including
cat(str_wrap(thanks, width = 60, indent = 2), "\n")
##   R would not be what it is today without the invaluable help
## of these people, who contributed by donating code, bug fixes
## and documentation: Valerio Aimale, Thomas Baier, Henrik
## Bengtsson, Roger Bivand, Ben Bolker, David Brahm, G"oran
## Brostr"om, Patrick Burns, Vince Carey, Saikat DebRoy, Brian
## D'Urso, Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom,
## Sebastian Fischmeister, John Fox, Paul Gilbert, Yu Gong,
## Gabor Grothendieck, Frank E Harrell Jr, Torsten Hothorn,
## Robert King, Kjetil Kjernsmo, Roger Koenker, Philippe
## Lambert, Jan de Leeuw, Jim Lindsey, Patrick Lindsey,
## Catherine Loader, Gordon Maclean, John Maindonald, David
## Meyer, Ei-ji Nakama, Jens Oehlschaegel, Steve Oncley,
## Richard O'Keefe, Hubert Palme, Roger D. Peng, Jose' C.
## Pinheiro, Tony Plate, Anthony Rossini, Jonathan Rougier,
## Petr Savicky, Guenther Sawitzki, Marc Schwartz, Detlef
## Steuer, Bill Simpson, Gordon Smyth, Adrian Trapletti, Terry
## Therneau, Rolf Turner, Bill Venables, Gregory R. Warnes,
## Andreas Weingessel, Morten Welinder, James Wettenhall, Simon
## Wood, and Achim Zeileis. Others have written code that has
## been adopted by R and is acknowledged in the code files,
## including
cat(str_wrap(thanks, width = 60, exdent = 2), "\n")
## R would not be what it is today without the invaluable help
##   of these people, who contributed by donating code, bug fixes
##   and documentation: Valerio Aimale, Thomas Baier, Henrik
##   Bengtsson, Roger Bivand, Ben Bolker, David Brahm, G"oran
##   Brostr"om, Patrick Burns, Vince Carey, Saikat DebRoy, Brian
##   D'Urso, Lyndon Drake, Dirk Eddelbuettel, Claus Ekstrom,
##   Sebastian Fischmeister, John Fox, Paul Gilbert, Yu Gong,
##   Gabor Grothendieck, Frank E Harrell Jr, Torsten Hothorn,
##   Robert King, Kjetil Kjernsmo, Roger Koenker, Philippe
##   Lambert, Jan de Leeuw, Jim Lindsey, Patrick Lindsey,
##   Catherine Loader, Gordon Maclean, John Maindonald, David
##   Meyer, Ei-ji Nakama, Jens Oehlschaegel, Steve Oncley,
##   Richard O'Keefe, Hubert Palme, Roger D. Peng, Jose' C.
##   Pinheiro, Tony Plate, Anthony Rossini, Jonathan Rougier,
##   Petr Savicky, Guenther Sawitzki, Marc Schwartz, Detlef
##   Steuer, Bill Simpson, Gordon Smyth, Adrian Trapletti, Terry
##   Therneau, Rolf Turner, Bill Venables, Gregory R. Warnes,
##   Andreas Weingessel, Morten Welinder, James Wettenhall, Simon
##   Wood, and Achim Zeileis. Others have written code that has
##   been adopted by R and is acknowledged in the code files,
##   including
cat(str_wrap(thanks, width = 0, exdent = 2), "\n")
## R
##   would
##   not
##   be
##   what
##   it
##   is
##   today
##   without
##   the
##   invaluable
##   help
##   of
##   these
##   people,
##   who
##   contributed
##   by
##   donating
##   code,
##   bug
##   fixes
##   and
##   documentation:
##   Valerio
##   Aimale,
##   Thomas
##   Baier,
##   Henrik
##   Bengtsson,
##   Roger
##   Bivand,
##   Ben
##   Bolker,
##   David
##   Brahm,
##   G"oran
##   Brostr"om,
##   Patrick
##   Burns,
##   Vince
##   Carey,
##   Saikat
##   DebRoy,
##   Brian
##   D'Urso,
##   Lyndon
##   Drake,
##   Dirk
##   Eddelbuettel,
##   Claus
##   Ekstrom,
##   Sebastian
##   Fischmeister,
##   John
##   Fox,
##   Paul
##   Gilbert,
##   Yu
##   Gong,
##   Gabor
##   Grothendieck,
##   Frank
##   E
##   Harrell
##   Jr,
##   Torsten
##   Hothorn,
##   Robert
##   King,
##   Kjetil
##   Kjernsmo,
##   Roger
##   Koenker,
##   Philippe
##   Lambert,
##   Jan
##   de
##   Leeuw,
##   Jim
##   Lindsey,
##   Patrick
##   Lindsey,
##   Catherine
##   Loader,
##   Gordon
##   Maclean,
##   John
##   Maindonald,
##   David
##   Meyer,
##   Ei-
##   ji
##   Nakama,
##   Jens
##   Oehlschaegel,
##   Steve
##   Oncley,
##   Richard
##   O'Keefe,
##   Hubert
##   Palme,
##   Roger
##   D.
##   Peng,
##   Jose'
##   C.
##   Pinheiro,
##   Tony
##   Plate,
##   Anthony
##   Rossini,
##   Jonathan
##   Rougier,
##   Petr
##   Savicky,
##   Guenther
##   Sawitzki,
##   Marc
##   Schwartz,
##   Detlef
##   Steuer,
##   Bill
##   Simpson,
##   Gordon
##   Smyth,
##   Adrian
##   Trapletti,
##   Terry
##   Therneau,
##   Rolf
##   Turner,
##   Bill
##   Venables,
##   Gregory
##   R.
##   Warnes,
##   Andreas
##   Weingessel,
##   Morten
##   Welinder,
##   James
##   Wettenhall,
##   Simon
##   Wood,
##   and
##   Achim
##   Zeileis.
##   Others
##   have
##   written
##   code
##   that
##   has
##   been
##   adopted
##   by
##   R
##   and
##   is
##   acknowledged
##   in
##   the
##   code
##   files,
##   including

Practice with regular expressions online: http://regexr.com/


Home