On this page:
3.7.1 Regexp Syntax
3.7.2 Additional Syntactic Constraints
3.7.3 Regexp Constructors
regexp?
pregexp?
byte-regexp?
byte-pregexp?
regexp
pregexp
byte-regexp
byte-pregexp
regexp-quote
3.7.4 Regexp Matching
regexp-match
regexp-match*
regexp-try-match
regexp-match-positions
regexp-match-positions*
regexp-match?
regexp-match-exact?
regexp-match-peek
regexp-match-peek-positions
regexp-match-peek-immediate
regexp-match-peek-positions-immediate
regexp-match-peek-positions*
3.7.5 Regexp Splitting
regexp-split
3.7.6 Regexp Substitution
regexp-replace
regexp-replace*
regexp-replace-quote
Version: 4.1.0.2

3.7 Regular Expressions

Regular Expressions in Guide: PLT Scheme introduces regular expressions.

    3.7.1 Regexp Syntax

    3.7.2 Additional Syntactic Constraints

    3.7.3 Regexp Constructors

    3.7.4 Regexp Matching

    3.7.5 Regexp Splitting

    3.7.6 Regexp Substitution

Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility egrep or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see Encodings and Locales) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.

Regular expressions can be compiled into a regexp value for repeated matches. The regexp and byte-regexp procedures convert a string or byte string (respectively) into a regexp value using one syntax of regular expressions that is most compatible to egrep. The pregexp and byte-pregexp procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl. In addition, Scheme constants written with #rx or #px (see The Reader) produce compiled regexp values.

The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators.

3.7.1 Regexp Syntax

The following syntax specifications describe the content of a string that represents a regular expression. The syntax of the corresponding string may involve extra escape characters. For example, the regular expression (.*)\1 can be represented with the string "(.*)\\1" or the regexp constant #rx"(.*)\\1"; the \ in the regular expression must be escaped to include it in a string or regexp constant.

The regexp and pregexp syntaxes share a common core:

 

regexp

 ::= 

pces

 

Match ‹pces

 

 

|

regexp|regexp

 

Match either ‹regexp›, try left first

 

pces

 ::= 

pce

 

Match ‹pce

 

 

|

pce›‹pces

 

Match ‹pce› followed by ‹pces

 

pce

 ::= 

repeat

 

Match ‹repeat›, longest possible

 

 

|

repeat?

 

Match ‹repeat›, shortest possible

 

 

|

atom

 

Match ‹atom› exactly once

 

repeat

 ::= 

atom*

 

Match ‹atom› 0 or more times

 

 

|

atom+

 

Match ‹atom› 1 or more times

 

 

|

atom?

 

Match ‹atom› 0 or 1 times

 

atom

 ::= 

(regexp)

 

Match sub-expression ‹regexp› and report

 

 

|

[rng]

 

Match any character in ‹rng

 

 

|

[^rng]

 

Match any character not in ‹rng

 

 

|

.

 

Match any (except newline in multi mode)

 

 

|

^

 

Match start (or after newline in multi mode)

 

 

|

$

 

Match end (or before newline in multi mode)

 

 

|

literal

 

Match a single literal character

 

 

|

(?mode:regexp)

 

Match ‹regexp› using ‹mode

 

 

|

(?>regexp)

 

Match ‹regexp›, only first possible

 

 

|

look

 

Match empty if ‹look› matches

 

 

|

(?tst›‹pces|pces)

 

Match 1st ‹pces› if ‹tst›, else 2nd ‹pces

 

 

|

(?tst›‹pces)

 

Match ‹pces› if ‹tst›, empty if not ‹tst

 

rng

 ::= 

]

 

rng› contains ] only

 

 

|

-

 

rng› contains - only

 

 

|

mrng

 

rng› contains everything in ‹mrng

 

 

|

mrng-

 

rng› contains - and everything in ‹mrng

 

mrng

 ::= 

]lrng

 

mrng› contains ] and everything in ‹lrng

 

 

|

-lrng

 

mrng› contains - and everything in ‹lrng

 

 

|

lrng

 

mrng› contains everything in ‹lrng

 

lrng

 ::= 

rliteral

 

lrng› contains a literal character

 

 

|

rliteral-rliteral

 

lrng› contains Unicode range inclusive

 

 

|

lrng›‹lrng

 

lrng› contains everything in both

 

look

 ::= 

(?=regexp)

 

Match if ‹regexp› matches

 

 

|

(?!regexp)

 

Match if ‹regexp› doesn't match

 

 

|

(?<=regexp)

 

Match if ‹regexp› matches preceeding

 

 

|

(?<!regexp)

 

Match if ‹regexp› doesn't match preceeding

 

tst

 ::= 

(n)

 

True if Nth ( has a match

 

 

|

look

 

True if ‹look› matches

 

mode

 ::= 

 

Like the enclosing mode

 

 

|

modei

 

Like ‹mode›, but case-insensitive

 

 

|

mode-i

 

Like ‹mode›, but sensitive

 

 

|

modes

 

Like ‹mode›, but not in multi mode

 

 

|

mode-s

 

Like ‹mode›, but in multi mode

 

 

|

modem

 

Like ‹mode›, but in multi mode

 

 

|

mode-m

 

Like ‹mode›, but not in multi mode

The following completes the grammar for regexp, which treats { and } as literals, \ as a literal within ranges, and \ as a literal producer outside of ranges.

 

literal

 ::= 

Any character except (, ), *, +, ?, [, ., ^, \, or |

 

 

|

\aliteral

 

Match ‹aliteral

 

aliteral

 ::= 

Any character

 

rliteral

 ::= 

Any character except ] or -

The following completes the grammar for pregexp, which uses { and } bounded repetition and uses \ for meta-characters both inside and outside of ranges.

 

repeat

 ::= 

...

 

...

 

 

|

atom{n}

 

Match ‹atom› exactly ‹n› times

 

 

|

atom{n,}

 

Match ‹atom› ‹n› or more times

 

 

|

atom{,m}

 

Match ‹atom› between 0 and ‹m› times

 

 

|

atom{n,m}

 

Match ‹atom› between ‹n› and ‹m› times

 

atom

 ::= 

...

 

...

 

 

|

\n

 

Match latest reported match for ‹n›th (

 

 

|

class

 

Match any character in ‹class

 

 

|

\b

 

Match \w* boundary

 

 

|

\B

 

Match where \b does not

 

 

|

\p{property}

 

Match (UTF-8 encoded) in ‹property

 

 

|

\P{property}

 

Match (UTF-8 encoded) not in ‹property

 

literal

 ::= 

Any character except (, ), *, +, ?, [, ], {, }, ., ^, \, or |

 

 

|

\aliteral

 

Match ‹aliteral

 

aliteral

 ::= 

Any character except a-z, A-Z, 0-9

 

lrng

 ::= 

...

 

...

 

 

|

class

 

lrng› contains all characters in ‹class

 

 

|

posix

 

lrng› contains all characters in ‹posix

 

 

|

\eliteral

 

lrng› contains ‹eliteral

 

rliteral

 ::= 

Any character except ], \, or -

 

eliteral

 ::= 

Any character except a-z, A-Z

 

class

 ::= 

\d

 

Contains 0-9

 

 

|

\D

 

Contains ASCII other than those in \d

 

 

|

\w

 

Contains a-z, A-Z, 0-9, _

 

 

|

\W

 

Contains ASCII other than those in \w

 

 

|

\s

 

Contains space, tab, newline, formfeed, return

 

 

|

\S

 

Contains ASCII other than those in \s

 

posix

 ::= 

[:alpha:]

 

Contains a-z, A-Z

 

 

|

[:alnum:]

 

Contains a-z, A-Z, 0-9

 

 

|

[:ascii:]

 

Contains all ASCII characters

 

 

|

[:blank:]

 

Contains space and tab

 

 

|

[:cntrl:]

 

Contains all characters with scalar value < 32

 

 

|

[:digit:]

 

Contains 0-9

 

 

|

[:graph:]

 

Contains all ASCII characters that use ink

 

 

|

[:lower:]

 

Contains space, tab, and ASCII ink users

 

 

|

[:print:]

 

Contains A-Z

 

 

|

[:space:]

 

Contains space, tab, newline, formfeed, return

 

 

|

[:upper:]

 

Contains A-Z

 

 

|

[:word:]

 

Contains a-z, A-Z, 0-9, _

 

 

|

[:xdigit:]

 

Contains 0-9, a-f, A-F

 

property

 ::= 

category

 

Includes all characters in ‹category

 

 

|

^category

 

Includes all characters not in ‹category

 

category

 ::= 

Ll | Lu | Lt | Lm

 

Unicode general category

 

 

|

L&

 

Union of Ll, Lu, Lt, and Lm

 

 

|

Lo

 

Unicode general category

 

 

|

L

 

Union of L& and Lo

 

 

|

Nd | Nl | No

 

Unicode general category

 

 

|

N

 

Union of Nd, Nl, and No

 

 

|

Ps | Pe | Pi | Pf

 

Unicode general category

 

 

|

Pc | Pd | Po

 

Unicode general category

 

 

|

P

 

Union of Ps, Pe, Pi, Pf, Pc, Pd, and Po

 

 

|

Mn | Mc | Me

 

Unicode general category

 

 

|

M

 

Union of Mn, Mc, and Me

 

 

|

Sc | Sk | Sm | So

 

Unicode general category

 

 

|

S

 

Union of Sc, Sk, Sm, and So

 

 

|

Zl | Zp | Zs

 

Unicode general category

 

 

|

Z

 

Union of Zl, Zp, and Zs

 

 

|

.

 

Union of all general categories

3.7.2 Additional Syntactic Constraints

In addition to matching a grammars, regular expressions must meet two syntactic restrictions:

These contraints are checked syntactically by the following type system. A type [n, m] corresponds to an expression that matches between n and m characters. In the rule for (Regexp), N means the number such that the opening parenthesis is the Nth opening parenthesis for collecting match reports. Non-emptiness is inferred for a backreference pattern, \N›, so that a backreference can be used for repetition patterns; in the case of mutual dependencies among backreferences, the inference chooses the fixpoint that maximizes non-emptiness. Finiteness is not inferred for backreferences (i.e., a backreference is assumed to match an arbitrarily large sequence).

 

 regexp1 : [n1, m1]   regexp2 : [n2, m2] 

 regexp1|regexp2 : [min(n1, n2), max(m1, m2)] 

 

 pce : [n1, m1]   pces : [n2, m2] 

 pce›‹pces : [n1+n2, m1+m2] 

 

 repeat : [n, m] 

 repeat? : [0, m] 

   

 atom : [n, m]   n > 0 

 atom* : [0, ∞] 

 

 atom : [n, m]   n > 0 

 atom+ : [1, ∞] 

   

 atom : [n, m] 

 atom? : [0, m] 

 

 atom : [n, m]   n > 0 

 atom{n} : [n*‹n›, m*‹n›] 

 

 atom : [n, m]   n > 0 

 atom{n,} : [n*‹n›, ∞] 

 

 atom : [n, m]   n > 0 

 atom{,m} : [0, m*‹m›] 

 

 atom : [n, m]   n > 0 

 atom{n,m} : [n*‹n›, m*‹m›] 

 

 regexp : [n, m] 

 (regexp) : [n, m]   αN=n 

 

 regexp : [n, m] 

 (?mode:regexp) : [n, m] 

 

 regexp : [n, m] 

 (?=regexp) : [0, 0] 

   

 regexp : [n, m] 

 (?!regexp) : [0, 0] 

 

 regexp : [n, m]   m < ∞ 

 (?<=regexp) : [0, 0] 

   

 regexp : [n, m]   m < ∞ 

 (?<!regexp) : [0, 0] 

 

 regexp : [n, m] 

 (?>regexp) : [n, m] 

 

 tst : [n0, m0]   pces1 : [n1, m1]   pces2 : [n2, m2] 

 (?tst›‹pces1|pces2) : [min(n1, n2), max(m1, m2)] 

 

 tst : [n0, m0]   pces : [n1, m1] 

 (?tst›‹pces) : [0, m1] 

 

(n) : N, ∞]

   

[rng] : [1, 1]

   

[^rng] : [1, 1]

 

. : [1, 1]

   

^ : [0, 0]

   

$ : [0, 0]

 

literal : [1, 1]

   

\n : N, ∞]

   

class : [1, 1]

 

\b : [0, 0]

   

\B : [0, 0]

 

\p{property} : [1, 6]

   

\P{property} : [1, 6]

3.7.3 Regexp Constructors

(regexp? v)  boolean?

  v : any/c

Returns #t if v is a regexp value created by regexp or pregexp, #f otherwise.

(pregexp? v)  boolean?

  v : any/c

Returns #t if v is a regexp value created by pregexp (not regexp), #f otherwise.

(byte-regexp? v)  boolean?

  v : any/c

Returns #t if v is a regexp value created by byte-regexp or byte-pregexp, #f otherwise.

(byte-pregexp? v)  boolean?

  v : any/c

Returns #t if v is a regexp value created by byte-pregexp (not byte-regexp), #f otherwise.

(regexp str)  regexp?

  str : string?

Takes a string representation of a regular expression (using the syntax in Regexp Syntax) and compiles it into a regexp value. Other regular expression procedures accept either a string or a regexp value as the matching pattern. If a regular expression string is used multiple times, it is faster to compile the string once to a regexp value and use it for repeated matches instead of using the string each time.

The object-name procedure returns the source string for a regexp value.

Examples:

  > (regexp "ap*le")

  #rx"ap*le"

  > (object-name #rx"ap*le")

  "ap*le"

(pregexp string)  pregexp?

  string : string?

Like regexp, except that it uses a slightly different syntax (see Regexp Syntax). The result can be used with regexp-match, etc., just like the result from regexp.

Examples:

  > (pregexp "ap*le")

  #px"ap*le"

  > (