3.7 Regular Expressions
Regular Expressions in Guide: PLT Scheme introduces regular expressions.
Regular expressions are specified as strings or byte strings, using the same pattern language as the Unix utility egrep or Perl. A string-specified pattern produces a character regexp matcher, and a byte-string pattern produces a byte regexp matcher. If a character regexp is used with a byte string or input port, it matches UTF-8 encodings (see Encodings and Locales) of matching character streams; if a byte regexp is used with a character string, it matches bytes in the UTF-8 encoding of the string.
Regular expressions can be compiled into a regexp value for repeated matches. The regexp and byte-regexp procedures convert a string or byte string (respectively) into a regexp value using one syntax of regular expressions that is most compatible to egrep. The pregexp and byte-pregexp procedures produce a regexp value using a slightly different syntax of regular expressions that is more compatible with Perl. In addition, Scheme constants written with #rx or #px (see The Reader) produce compiled regexp values.
The internal size of a regexp value is limited to 32 kilobytes; this limit roughly corresponds to a source string with 32,000 literal characters or 5,000 operators.
3.7.1 Regexp Syntax
The following syntax specifications describe the content of a string that represents a regular expression. The syntax of the corresponding string may involve extra escape characters. For example, the regular expression (.*)\1 can be represented with the string "(.*)\\1" or the regexp constant #rx"(.*)\\1"; the \ in the regular expression must be escaped to include it in a string or regexp constant.
The regexp and pregexp syntaxes share a common core:
| ‹regexp› | ::= | ‹pces› |
| Match ‹pces› |
|
| | | ‹regexp›|‹regexp› |
| Match either ‹regexp›, try left first |
| ‹pces› | ::= | ‹pce› |
| Match ‹pce› |
|
| | | ‹pce›‹pces› |
| Match ‹pce› followed by ‹pces› |
| ‹pce› | ::= | ‹repeat› |
| Match ‹repeat›, longest possible |
|
| | | ‹repeat›? |
| Match ‹repeat›, shortest possible |
|
| | | ‹atom› |
| Match ‹atom› exactly once |
| ‹repeat› | ::= | ‹atom›* |
| Match ‹atom› 0 or more times |
|
| | | ‹atom›+ |
| Match ‹atom› 1 or more times |
|
| | | ‹atom›? |
| Match ‹atom› 0 or 1 times |
| ‹atom› | ::= | (‹regexp›) |
| Match sub-expression ‹regexp› and report |
|
| | | [‹rng›] |
| Match any character in ‹rng› |
|
| | | [^‹rng›] |
| Match any character not in ‹rng› |
|
| | | . |
| Match any (except newline in multi mode) |
|
| | | ^ |
| Match start (or after newline in multi mode) |
|
| | | $ |
| Match end (or before newline in multi mode) |
|
| | | ‹literal› |
| Match a single literal character |
|
| | | (?‹mode›:‹regexp›) |
| Match ‹regexp› using ‹mode› |
|
| | | (?>‹regexp›) |
| Match ‹regexp›, only first possible |
|
| | | ‹look› |
| Match empty if ‹look› matches |
|
| | | (?‹tst›‹pces›|‹pces›) |
| Match 1st ‹pces› if ‹tst›, else 2nd ‹pces› |
|
| | | (?‹tst›‹pces›) |
| Match ‹pces› if ‹tst›, empty if not ‹tst› |
| ‹rng› | ::= | ] |
| ‹rng› contains ] only |
|
| | | - |
| ‹rng› contains - only |
|
| | | ‹mrng› |
| ‹rng› contains everything in ‹mrng› |
|
| | | ‹mrng›- |
| ‹rng› contains - and everything in ‹mrng› |
| ‹mrng› | ::= | ]‹lrng› |
| ‹mrng› contains ] and everything in ‹lrng› |
|
| | | -‹lrng› |
| ‹mrng› contains - and everything in ‹lrng› |
|
| | | ‹lrng› |
| ‹mrng› contains everything in ‹lrng› |
| ‹lrng› | ::= | ‹rliteral› |
| ‹lrng› contains a literal character |
|
| | | ‹rliteral›-‹rliteral› |
| ‹lrng› contains Unicode range inclusive |
|
| | | ‹lrng›‹lrng› |
| ‹lrng› contains everything in both |
| ‹look› | ::= | (?=‹regexp›) |
| Match if ‹regexp› matches |
|
| | | (?!‹regexp›) |
| Match if ‹regexp› doesn't match |
|
| | | (?<=‹regexp›) |
| Match if ‹regexp› matches preceeding |
|
| | | (?<!‹regexp›) |
| Match if ‹regexp› doesn't match preceeding |
| ‹tst› | ::= | (‹n›) |
| True if Nth ( has a match |
|
| | | ‹look› |
| True if ‹look› matches |
| ‹mode› | ::= |
| Like the enclosing mode | |
|
| | | ‹mode›i |
| Like ‹mode›, but case-insensitive |
|
| | | ‹mode›-i |
| Like ‹mode›, but sensitive |
|
| | | ‹mode›s |
| Like ‹mode›, but not in multi mode |
|
| | | ‹mode›-s |
| Like ‹mode›, but in multi mode |
|
| | | ‹mode›m |
| Like ‹mode›, but in multi mode |
|
| | | ‹mode›-m |
| Like ‹mode›, but not in multi mode |
The following completes the grammar for regexp, which treats { and } as literals, \ as a literal within ranges, and \ as a literal producer outside of ranges.
| ‹literal› | ::= | Any character except (, ), *, +, ?, [, ., ^, \, or | | ||
|
| | | \‹aliteral› |
| Match ‹aliteral› |
| ‹aliteral› | ::= | Any character | ||
| ‹rliteral› | ::= | Any character except ] or - | ||
The following completes the grammar for pregexp, which uses { and } bounded repetition and uses \ for meta-characters both inside and outside of ranges.
| ‹repeat› | ::= | ... |
| ... |
|
| | | ‹atom›{‹n›} |
| Match ‹atom› exactly ‹n› times |
|
| | | ‹atom›{‹n›,} |
| Match ‹atom› ‹n› or more times |
|
| | | ‹atom›{,‹m›} |
| Match ‹atom› between 0 and ‹m› times |
|
| | | ‹atom›{‹n›,‹m›} |
| Match ‹atom› between ‹n› and ‹m› times |
| ‹atom› | ::= | ... |
| ... |
|
| | | \‹n› |
| Match latest reported match for ‹n›th ( |
|
| | | ‹class› |
| Match any character in ‹class› |
|
| | | \b |
| Match \w* boundary |
|
| | | \B |
| Match where \b does not |
|
| | | \p{‹property›} |
| Match (UTF-8 encoded) in ‹property› |
|
| | | \P{‹property›} |
| Match (UTF-8 encoded) not in ‹property› |
| ‹literal› | ::= | Any character except (, ), *, +, ?, [, ], {, }, ., ^, \, or | | ||
|
| | | \‹aliteral› |
| Match ‹aliteral› |
| ‹aliteral› | ::= | Any character except a-z, A-Z, 0-9 | ||
| ‹lrng› | ::= | ... |
| ... |
|
| | | ‹class› |
| ‹lrng› contains all characters in ‹class› |
|
| | | ‹posix› |
| ‹lrng› contains all characters in ‹posix› |
|
| | | \‹eliteral› |
| ‹lrng› contains ‹eliteral› |
| ‹rliteral› | ::= | Any character except ], \, or - | ||
| ‹eliteral› | ::= | Any character except a-z, A-Z | ||
| ‹class› | ::= | \d |
| Contains 0-9 |
|
| | | \D |
| Contains ASCII other than those in \d |
|
| | | \w |
| Contains a-z, A-Z, 0-9, _ |
|
| | | \W |
| Contains ASCII other than those in \w |
|
| | | \s |
| Contains space, tab, newline, formfeed, return |
|
| | | \S |
| Contains ASCII other than those in \s |
| ‹posix› | ::= | [:alpha:] |
| Contains a-z, A-Z |
|
| | | [:alnum:] |
| Contains a-z, A-Z, 0-9 |
|
| | | [:ascii:] |
| Contains all ASCII characters |
|
| | | [:blank:] |
| Contains space and tab |
|
| | | [:cntrl:] |
| Contains all characters with scalar value < 32 |
|
| | | [:digit:] |
| Contains 0-9 |
|
| | | [:graph:] |
| Contains all ASCII characters that use ink |
|
| | | [:lower:] |
| Contains space, tab, and ASCII ink users |
|
| | | [:print:] |
| Contains A-Z |
|
| | | [:space:] |
| Contains space, tab, newline, formfeed, return |
|
| | | [:upper:] |
| Contains A-Z |
|
| | | [:word:] |
| Contains a-z, A-Z, 0-9, _ |
|
| | | [:xdigit:] |
| Contains 0-9, a-f, A-F |
| ‹property› | ::= | ‹category› |
| Includes all characters in ‹category› |
|
| | | ^‹category› |
| Includes all characters not in ‹category› |
| ‹category› | ::= | Ll | Lu | Lt | Lm |
| Unicode general category |
|
| | | L& |
| Union of Ll, Lu, Lt, and Lm |
|
| | | Lo |
| Unicode general category |
|
| | | L |
| Union of L& and Lo |
|
| | | Nd | Nl | No |
| Unicode general category |
|
| | | N |
| Union of Nd, Nl, and No |
|
| | | Ps | Pe | Pi | Pf |
| Unicode general category |
|
| | | Pc | Pd | Po |
| Unicode general category |
|
| | | P |
| Union of Ps, Pe, Pi, Pf, Pc, Pd, and Po |
|
| | | Mn | Mc | Me |
| Unicode general category |
|
| | | M |
| Union of Mn, Mc, and Me |
|
| | | Sc | Sk | Sm | So |
| Unicode general category |
|
| | | S |
| Union of Sc, Sk, Sm, and So |
|
| | | Zl | Zp | Zs |
| Unicode general category |
|
| | | Z |
| Union of Zl, Zp, and Zs |
|
| | | . |
| Union of all general categories |
3.7.2 Additional Syntactic Constraints
In addition to matching a grammars, regular expressions must meet two syntactic restrictions:
In a ‹repeat› other than ‹atom›?, then ‹atom› must not match an empty sequence.
In a (?<=‹regexp›) or (?<!‹regexp›), the ‹regexp› must match a bounded sequence, only.
These contraints are checked syntactically by the following type system. A type [n, m] corresponds to an expression that matches between n and m characters. In the rule for (‹Regexp›), N means the number such that the opening parenthesis is the Nth opening parenthesis for collecting match reports. Non-emptiness is inferred for a backreference pattern, \‹N›, so that a backreference can be used for repetition patterns; in the case of mutual dependencies among backreferences, the inference chooses the fixpoint that maximizes non-emptiness. Finiteness is not inferred for backreferences (i.e., a backreference is assumed to match an arbitrarily large sequence).
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
| |||||||
|
3.7.3 Regexp Constructors
v : any/c |
Returns #t if v is a regexp value created by regexp or pregexp, #f otherwise.
v : any/c |
Returns #t if v is a regexp value created by pregexp (not regexp), #f otherwise.
(byte-regexp? v) → boolean? |
v : any/c |
Returns #t if v is a regexp value created by byte-regexp or byte-pregexp, #f otherwise.
(byte-pregexp? v) → boolean? |
v : any/c |
Returns #t if v is a regexp value created by byte-pregexp (not byte-regexp), #f otherwise.
str : string? |
Takes a string representation of a regular expression (using the syntax in Regexp Syntax) and compiles it into a regexp value. Other regular expression procedures accept either a string or a regexp value as the matching pattern. If a regular expression string is used multiple times, it is faster to compile the string once to a regexp value and use it for repeated matches instead of using the string each time.
The object-name procedure returns the source string for a regexp value.
Examples: |
> (regexp "ap*le") |
#rx"ap*le" |
> (object-name #rx"ap*le") |
"ap*le" |
string : string? |
Like regexp, except that it uses a slightly different syntax (see Regexp Syntax). The result can be used with regexp-match, etc., just like the result from regexp.
Examples: |
> (pregexp "ap*le") |
#px"ap*le" |