IrRegular Expressions
| 
 | 
| \title{IrRegular Expressions} | 
| 
 | 
| \hyperlink[http://synthcode.com/]{Alex Shinn} | 
| \hyperlink[http://synthcode.com/scheme/irregex/irregex-0.9.9.tar.gz]{Download Version 0.9.9} | 
| 
 | 
| \centered{\italic{\pre{ | 
| At this moment there was a loud ring at the bell, and I could | 
| hear Mrs. Hudson, our landlady, raising her voice in a wail of | 
| expostulation and dismay. | 
| 
 | 
| "By heaven, Holmes," I said, half rising, "I believe that | 
| they are really after us." | 
| 
 | 
| "No, it's not quite so bad as that. It is the unofficial | 
| force, -- the Baker Street irregulars."}}} | 
| 
 | 
| A fully portable and efficient R[4567]RS implementation of regular | 
| expressions, supporting both POSIX syntax with various (irregular) | 
| PCRE extensions, as well as SCSH's SRE syntax, with various aliases | 
| for commonly used patterns. DFA matching is used when possible, | 
| otherwise a closure-compiled NFA approach is used. The library makes | 
| no assumptions about the encoding of strings or range of characters | 
| and can thus be used in Unicode-aware Scheme implementations. | 
| Matching may be performed over standard Scheme strings, or over | 
| arbitrarily chunked streams of strings. | 
| 
 | 
| \section{Installation} | 
| 
 | 
| Just | 
| 
 | 
| \scheme{ | 
| (load "irregex.scm") | 
| } | 
| 
 | 
| in your favorite Scheme implementation and you're good to go! | 
| 
 | 
| There is a global variable \scheme{*all-chars*} which is used for | 
| generating character set complements. This defaults to the full | 
| Unicode range 0..#x10FFFF, but if your implementation can't handle | 
| characters that large you'll need to adjust it (a suitable ASCII | 
| definition is commented out in the source). | 
| 
 | 
| If using an R7RS Schem you can use irregex.sld, or install | 
| \scheme{(chibi irregex)} from \url{http://snow-fort.org/}. | 
| 
 | 
| If you are using an R6RS Scheme, you can instead | 
| 
 | 
| \scheme{ | 
| (load "irregex-r6rs.scm") | 
| } | 
| 
 | 
| There are also a handful of utility procedures described below you may | 
| wish to use in irregex-utils.scm. | 
| 
 | 
| If you are using Chicken Scheme IrRegex is built in as a core unit, so | 
| no need to install it. To use it, you just need to \scheme{(use irregex)}. | 
| 
 | 
| \section{Specification} | 
| 
 | 
| \subsection{Procedures} | 
| 
 | 
| \subsubsection{(irregex <posix-string-or-sre> [<options> ...])} | 
| \subsubsection{(string->irregex <posix-string> [<options> ...])} | 
| \subsubsection{(sre->irregex <sre> [<options> ...])} | 
| 
 | 
| Compiles a regular expression from either a POSIX-style regular | 
| expression string (with most PCRE extensions) or an SCSH-style SRE. | 
| There is no \scheme{(rx ...)} syntax - just use normal Scheme lists, with | 
| \scheme{quasiquote} if you like. | 
| 
 | 
| Technically a string by itself could be considered a valid (though | 
| rather silly) SRE, so if you want to just match a literal string you | 
| should use something like \scheme{(irregex `(: ,str))}, or use the explicit | 
| \scheme{(sre->irregex str)}. | 
| 
 | 
| The options are a list of any of the following symbols: | 
| 
 | 
| \scheme{'i}, \scheme{'case-insensitive} - match case-insensitively | 
| 
 | 
| \scheme{'m}, \scheme{'multi-line} - treat string as multiple lines (effects ^ and $) | 
| 
 | 
| \scheme{'s}, \scheme{'single-line} - treat string as a single line (. can match newline) | 
| 
 | 
| \scheme{'utf8} - utf8-mode (assumes strings are byte-strings) | 
| 
 | 
| \scheme{'fast} - try to optimize the regular expression | 
| 
 | 
| \scheme{'small} - try to compile a smaller regular expression | 
| 
 | 
| \scheme{'backtrack} - enforce a backtracking implementation | 
| 
 | 
| The \scheme{'fast} and \scheme{'small} options are heuristic guidelines and will | 
| not necessarily make the compiled expression faster or smaller. | 
| 
 | 
| \subsubsection{(string->sre <str>)} | 
| \subsubsection{(maybe-string->sre <obj>)} | 
| 
 | 
| For backwards compatibility, procedures to convert a POSIX string into | 
| an SRE. | 
| 
 | 
| \scheme{maybe-string->sre} does the same thing, but only if the argument is | 
| a string, otherwise it assumes \scheme{<obj>} is an SRE and returns it | 
| as-is. This is useful when you want to provide an API that allows | 
| either a POSIX string or SRE (like \scheme{irregex} or \scheme{irregex-search} | 
| below) - it ensures the result is an SRE. | 
| 
 | 
| \subsubsection{(irregex? <obj>)} | 
| 
 | 
| Returns \scheme{#t} iff the object is a regular expression. | 
| 
 | 
| \subsubsection{(irregex-search <irx> <str> [<start> <end>])} | 
| 
 | 
| Searches for any instances of the pattern <irx> (a POSIX string, SRE | 
| sexp, or pre-compiled regular expression) in <str>, optionally between | 
| the given range. If a match is found, returns a match object, | 
| otherwise returns \scheme{#f}. | 
| 
 | 
| Match objects can be used to query the original range of the string or | 
| its submatches using the \scheme{irregex-match-*} procedures below. | 
| 
 | 
| Examples: | 
| 
 | 
| \scheme{(irregex-search "foobar" "abcFOOBARdef") => #f} | 
| 
 | 
| \scheme{(irregex-search (irregex "foobar" 'i) "abcFOOBARdef") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(w/nocase "foobar") "abcFOOBARdef") => #<match>} | 
| 
 | 
| Note, the actual match result is represented by a vector in the | 
| default implementation. Throughout this document, we'll just write | 
| \scheme{<match>} to show that a successful match was returned when the | 
| details are not important. | 
| 
 | 
| Matching follows the POSIX leftmost, longest semantics, when | 
| searching. That is, of all possible matches in the string, | 
| \scheme{irregex-search} will return the match at the first position | 
| (leftmost). If multiple matches are possible from that same first | 
| position, the longest match is returned. | 
| 
 | 
| \subsubsection{(irregex-match <irx> <str> [<start> <end>])} | 
| 
 | 
| Like \scheme{irregex-search}, but performs an anchored match against the | 
| beginning and end of the substring specified by <start> and <end>, | 
| without searching. | 
| 
 | 
| Examples: | 
| 
 | 
| \scheme{(irregex-match '(w/nocase "foobar") "abcFOOBARdef") => #f} | 
| 
 | 
| \scheme{(irregex-match '(w/nocase "foobar") "FOOBAR") => #<match>} | 
| 
 | 
| \subsubsection{(irregex-match-data? <obj>)} | 
| 
 | 
| Returns \scheme{#t} iff the object is a successful match result from | 
| \scheme{irregex-search} or \scheme{irregex-match}. | 
| 
 | 
| \subsubsection{(irregex-num-submatches <irx>)} | 
| \subsubsection{(irregex-match-num-submatches <match>)} | 
| 
 | 
| Returns the number of numbered submatches that are defined in the | 
| irregex or match object. | 
| 
 | 
| \subsubsection{(irregex-names <irx>)} | 
| \subsubsection{(irregex-match-names <match>)} | 
| 
 | 
| Returns an association list of named submatches that are defined in | 
| the irregex or match object. The \scheme{car} of each item in this list is | 
| the name of a submatch, the \scheme{cdr} of each item is the numerical | 
| submatch corresponding to this name. If a named submatch occurs | 
| multiple times in the irregex, it will also occur multiple times in | 
| this list. | 
| 
 | 
| \subsubsection{(irregex-match-valid-index? <match> <index-or-name>)} | 
| 
 | 
| Returns \scheme{#t} iff the \scheme{index-or-name} named submatch or index is | 
| defined in the \scheme{match} object. | 
| 
 | 
| \subsubsection{(irregex-match-substring <match> [<index-or-name>])} | 
| \subsubsection{(irregex-match-start-index <match> [<index-or-name>])} | 
| \subsubsection{(irregex-match-end-index <match> [<index-or-name>])} | 
| 
 | 
| Fetches the matched substring (or its start or end offset) at the | 
| given submatch index, or named submatch. The entire match is index 0, | 
| the first 1, etc. The default is index 0. | 
| 
 | 
| \subsubsection{(irregex-match-subchunk <match> [<index-or-name>])} | 
| 
 | 
| Generates a chunked data-type for the given match item, of the same | 
| type as the underlying chunk type (see Chunked String Matching below). | 
| This is only available if the chunk type specifies the get-subchunk | 
| API, otherwise an error is raised. | 
| 
 | 
| \subsubsection{(irregex-replace <irx> <str> [<replacements> ...])} | 
| \subsubsection{(irregex-replace/all <irx> <str> [<replacements> ...])} | 
| 
 | 
| Matches a pattern in a string, and replaces it with a (possibly empty) | 
| list of substitutions. Each \scheme{<replacement>} can be either a string | 
| literal, a numeric index, a symbol (as a named submatch), or a | 
| procedure which takes one argument (the match object) and returns a | 
| string. | 
| 
 | 
| Examples: | 
| 
 | 
| \scheme{(irregex-replace "[aeiou]" "hello world" "*") => "h*llo world"} | 
| 
 | 
| \scheme{(irregex-replace/all "[aeiou]" "hello world" "*") => "h*ll* w*rld"} | 
| 
 | 
| \scheme{(irregex-replace/all '(* "poo ") "poo poo platter" "*") => "**p*l*a*t*t*e*r"} | 
| 
 | 
| \subsubsection{(irregex-split <irx> <str> [<start> <end>])} | 
| \subsubsection{(irregex-extract <irx> <str> [<start> <end>])} | 
| 
 | 
| \scheme{irregex-split} splits the string \scheme{<str>} into substrings divided | 
| by the pattern in \scheme{<irx>}. \scheme{irregex-extract} does the opposite, | 
| returning a list of each instance of the pattern matched disregarding | 
| the substrings in between. | 
| 
 | 
| Empty matches will result in subsequent single character string in | 
| \scheme{irregex-split}, or empty strings in \scheme{irregex-extract}. | 
| 
 | 
| \scheme{(irregex-split "[aeiou]*" "foobarbaz") => '("f" "b" "r" "b" "z")} | 
| 
 | 
| \scheme{(irregex-extract "[aeiou]*" "foobarbaz") => '("" "oo" "" "a" "" "" "a" "")} | 
| 
 | 
| \subsubsection{(irregex-fold <irx> <kons> <knil> <str> [<finish> <start> <end>])} | 
| 
 | 
| This performs a fold operation over every non-overlapping place | 
| \scheme{<irx>} occurs in the string \scheme{str}. | 
| 
 | 
| The \scheme{<kons>} procedure takes the following signature: | 
| 
 | 
| \scheme{(<kons> <from-index> <match> <seed>)} | 
| 
 | 
| where \scheme{<from-index>} is the index from where we started searching | 
| (initially \scheme{<start>} and thereafter the end index of the last | 
| match), \scheme{<match>} is the resulting match-data object, and \scheme{<seed>} | 
| is the accumulated fold result starting with \scheme{<knil>}. | 
| 
 | 
| The rationale for providing the \scheme{<from-index>} (which is not | 
| provided in the SCSH \scheme{regexp-fold} utility), is because this | 
| information is useful (e.g. for extracting the unmatched portion of | 
| the string before the current match, as needed in | 
| \scheme{irregex-replace/all}), and not otherwise directly accessible. | 
| 
 | 
| Note when the pattern matches an empty string, to avoid an infinite | 
| loop we continue from one char after the end of the match (as opposed | 
| to the end in the normal case). The \scheme{<from-index>} passed to | 
| the subsequent \scheme{<kons>} or \scheme{<finish>} still refers to | 
| the original previous match end, however, so \scheme{irregex-split} | 
| and \scheme{irregex-replace/all}, etc. do the right thing. | 
| 
 | 
| The optional \scheme{<finish>} takes two arguments: | 
| 
 | 
| \scheme{(<finish> <from-index> <seed>)} | 
| 
 | 
| which similarly allows you to pick up the unmatched tail of the string, | 
| and defaults to just returning the \scheme{<seed>}. | 
| 
 | 
| \scheme{<start>} and \scheme{<end>} are numeric indices letting you specify the | 
| boundaries of the string on which you want to fold. | 
| 
 | 
| To extract all instances of a match out of a string, you can use | 
| 
 | 
| \schemeblock{ | 
| (map irregex-match-substring | 
| (irregex-fold <irx> | 
| (lambda (i m s) (cons m s)) | 
| '() | 
| <str> | 
| (lambda (i s) (reverse s))))} | 
| 
 | 
| Note if an empty match is found \scheme{<kons>} will be called on that | 
| empty string, and to avoid an infinite loop matching will resume at | 
| the next char. It is up to the programmer to do something sensible | 
| with the skipped char in this case. | 
| 
 | 
| \subsection{Extended SRE Syntax} | 
| 
 | 
| Irregex provides the first native implementation of SREs (Scheme | 
| Regular Expressions), and includes many extensions necessary both for | 
| minimal POSIX compatibility, as well as for modern extensions found in | 
| libraries such as PCRE. | 
| 
 | 
| The following table summarizes the SRE syntax, with detailed | 
| explanations following. | 
| 
 | 
| \pre{\scheme{ | 
| ;; basic patterns | 
| <string> ; literal string | 
| (seq <sre> ...) ; sequence | 
| (: <sre> ...) | 
| (or <sre> ...) ; alternation | 
| 
 | 
| ;; optional/multiple patterns | 
| (? <sre> ...) ; 0 or 1 matches | 
| (* <sre> ...) ; 0 or more matches | 
| (+ <sre> ...) ; 1 or more matches | 
| (= <n> <sre> ...) ; exactly <n> matches | 
| (>= <n> <sre> ...) ; <n> or more matches | 
| (** <from> <to> <sre> ...) ; <n> to <m> matches | 
| (?? <sre> ...) ; non-greedy (non-greedy) pattern: (0 or 1) | 
| (*? <sre> ...) ; non-greedy kleene star | 
| (**? <from> <to> <sre> ...) ; non-greedy range | 
| 
 | 
| ;; submatch patterns | 
| (submatch <sre> ...) ; numbered submatch | 
| ($ <sre> ...) | 
| (submatch-named <name> <sre> ...) ; named submatch | 
| (=> <name> <sre> ...) | 
| (backref <n-or-name>) ; match a previous submatch | 
| 
 | 
| ;; toggling case-sensitivity | 
| (w/case <sre> ...) ; enclosed <sre>s are case-sensitive | 
| (w/nocase <sre> ...) ; enclosed <sre>s are case-insensitive | 
| 
 | 
| ;; character sets | 
| <char> ; singleton char set | 
| (<string>) ; set of chars | 
| (or <cset-sre> ...) ; set union | 
| (~ <cset-sre> ...) ; set complement (i.e. [^...]) | 
| (- <cset-sre> ...) ; set difference | 
| (& <cset-sre> ...) ; set intersection | 
| (/ <range-spec> ...) ; pairs of chars as ranges | 
| 
 | 
| ;; named character sets | 
| any | 
| nonl | 
| ascii | 
| lower-case lower | 
| upper-case upper | 
| alphabetic alpha | 
| numeric num | 
| alphanumeric alphanum alnum | 
| punctuation punct | 
| graphic graph | 
| whitespace white space | 
| printing print | 
| control cntrl | 
| hex-digit xdigit | 
| 
 | 
| ;; assertions and conditionals | 
| bos eos ; beginning/end of string | 
| bol eol ; beginning/end of line | 
| bow eow ; beginning/end of word | 
| nwb ; non-word-boundary | 
| (look-ahead <sre> ...) ; zero-width look-ahead assertion | 
| (look-behind <sre> ...) ; zero-width look-behind assertion | 
| (neg-look-ahead <sre> ...) ; zero-width negative look-ahead assertion | 
| (neg-look-behind <sre> ...) ; zero-width negative look-behind assertion | 
| (atomic <sre> ...) ; for (?>...) independent patterns | 
| (if <test> <pass> [<fail>]) ; conditional patterns | 
| commit ; don't backtrack beyond this (i.e. cut) | 
| 
 | 
| ;; backwards compatibility | 
| (posix-string <string>) ; embed a POSIX string literal | 
| }} | 
| 
 | 
| \subsubsection{Basic SRE Patterns} | 
| 
 | 
| The simplest SRE is a literal string, which matches that string | 
| exactly. | 
| 
 | 
| \scheme{(irregex-search "needle" "hayneedlehay") => #<match>} | 
| 
 | 
| By default the match is case-sensitive, though you can control this | 
| either with the compiler flags or local overrides: | 
| 
 | 
| \scheme{(irregex-search "needle" "haynEEdlehay") => #f} | 
| 
 | 
| \scheme{(irregex-search (irregex "needle" 'i) "haynEEdlehay") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(w/nocase "needle") "haynEEdlehay") => #<match>} | 
| 
 | 
| You can use \scheme{w/case} to switch back to case-sensitivity inside a | 
| \scheme{w/nocase} or when the SRE was compiled with \scheme{'i}: | 
| 
 | 
| \scheme{(irregex-search '(w/nocase "SMALL" (w/case "BIG")) "smallBIGsmall") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(w/nocase "small" (w/case "big")) "smallBIGsmall") => #f} | 
| 
 | 
| \b{Important:} characters outside the ASCII range are only matched | 
| case insensitively if the host Scheme system natively supports UTF8 in | 
| strings. | 
| 
 | 
| Of course, literal strings by themselves aren't very interesting | 
| regular expressions, so we want to be able to compose them. The most | 
| basic way to do this is with the \scheme{seq} operator (or its abbreviation | 
| \scheme{:}), which matches one or more patterns consecutively: | 
| 
 | 
| \scheme{(irregex-search '(: "one" space "two" space "three") "one two three") => #<match>} | 
| 
 | 
| As you may have noticed above, the \scheme{w/case} and \scheme{w/nocase} | 
| operators allowed multiple SREs in a sequence - other operators that | 
| take any number of arguments (e.g. the repetition operators below) | 
| allow such implicit sequences. | 
| 
 | 
| To match any one of a set of patterns use the \scheme{or} alternation | 
| operator: | 
| 
 | 
| \scheme{(irregex-search '(or "eeney" "meeney" "miney") "meeney") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(or "eeney" "meeney" "miney") "moe") => #f} | 
| 
 | 
| \subsubsection{SRE Repetition Patterns} | 
| 
 | 
| There are also several ways to control the number of times a pattern | 
| is matched. The simplest of these is \scheme{?} which just optionally | 
| matches the pattern: | 
| 
 | 
| \scheme{(irregex-search '(: "match" (? "es") "!") "matches!") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "match" (? "es") "!") "match!") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "match" (? "es") "!") "matche!") => #f} | 
| 
 | 
| To optionally match any number of times, use \scheme{*}, the Kleene star: | 
| 
 | 
| \scheme{(irregex-search '(: "<" (* (~ #\\>)) ">") "<html>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (* (~ #\\>)) ">") "<>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (* (~ #\\>)) ">") "<html") => #f} | 
| 
 | 
| Often you want to match any number of times, but at least one time is | 
| required, and for that you use \scheme{+}: | 
| 
 | 
| \scheme{(irregex-search '(: "<" (+ (~ #\\>)) ">") "<html>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (+ (~ #\\>)) ">") "<a>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (+ (~ #\\>)) ">") "<>") => #f} | 
| 
 | 
| More generally, to match at least a given number of times, use \scheme{>=}: | 
| 
 | 
| \scheme{(irregex-search '(: "<" (>= 3 (~ #\\>)) ">") "<table>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (>= 3 (~ #\\>)) ">") "<pre>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (>= 3 (~ #\\>)) ">") "<tr>") => #f} | 
| 
 | 
| To match a specific number of times exactly, use \scheme{=}: | 
| 
 | 
| \scheme{(irregex-search '(: "<" (= 4 (~ #\\>)) ">") "<html>") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "<" (= 4 (~ #\\>)) ">") "<table>") => #f} | 
| 
 | 
| And finally, the most general form is \scheme{**} which specifies a range | 
| of times to match. All of the earlier forms are special cases of this. | 
| 
 | 
| \scheme{(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.168.1.10") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: (= 3 (** 1 3 numeric) ".") (** 1 3 numeric)) "192.0168.1.10") => #f} | 
| 
 | 
| There are also so-called "non-greedy" variants of these repetition | 
| operators, by convention suffixed with an additional \scheme{?}. Since the | 
| normal repetition patterns can match any of the allotted repetition | 
| range, these operators will match a string if and only if the normal | 
| versions matched. However, when the endpoints of which submatch | 
| matched where are taken into account (specifically, all matches when | 
| using irregex-search since the endpoints of the match itself matter), | 
| the use of a non-greedy repetition can change the result. | 
| 
 | 
| So, whereas \scheme{?} can be thought to mean "match or don't match," | 
| \scheme{??} means "don't match or match." \scheme{*} typically consumes as much | 
| as possible, but \scheme{*?} tries first to match zero times, and only | 
| consumes one at a time if that fails. If you have a greedy operator | 
| followed by a non-greedy operator in the same pattern, they can | 
| produce surprisins results as they compete to make the match longer or | 
| shorter. If this seems confusing, that's because it is. Non-greedy | 
| repetitions are defined only in terms of the specific backtracking | 
| algorithm used to implement them, which for compatibility purposes | 
| always means the Perl algorithm. Thus, when using these patterns you | 
| force IrRegex to use a backtracking engine, and can't rely on | 
| efficient execution. | 
| 
 | 
| \subsubsection{SRE Character Sets} | 
| 
 | 
| Perhaps more common than matching specific strings is matching any of | 
| a set of characters. You can use the \scheme{or} alternation pattern on a | 
| list of single-character strings to simulate a character set, but this | 
| is too clumsy for everyday use so SRE syntax allows a number of | 
| shortcuts. | 
| 
 | 
| A single character matches that character literally, a trivial | 
| character class. More conveniently, a list holding a single element | 
| which is a string refers to the character set composed of every | 
| character in the string. | 
| 
 | 
| \scheme{(irregex-match '(* #\\-) "---") => #<match>} | 
| 
 | 
| \scheme{(irregex-match '(* #\\-) "-_-") => #f} | 
| 
 | 
| \scheme{(irregex-match '(* ("aeiou")) "oui") => #<match>} | 
| 
 | 
| \scheme{(irregex-match '(* ("aeiou")) "ouais") => #f} | 
| 
 | 
| Ranges are introduced with the \scheme{/} operator. Any strings or | 
| characters in the \scheme{/} are flattened and then taken in pairs to | 
| represent the start and end points, inclusive, of character ranges. | 
| 
 | 
| \scheme{(irregex-match '(* (/ "AZ09")) "R2D2") => #<match>} | 
| 
 | 
| \scheme{(irregex-match '(* (/ "AZ09")) "C-3PO") => #f} | 
| 
 | 
| In addition, a number of set algebra operations are provided. \scheme{or}, | 
| of course, has the same meaning, but when all the options are | 
| character sets it can be thought of as the set union operator. This | 
| is further extended by the \scheme{&} set intersection, \scheme{-} set | 
| difference, and \scheme{~} set complement operators. | 
| 
 | 
| \scheme{(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "xyzzy") => #<match>} | 
| 
 | 
| \scheme{(irregex-match '(* (& (/ "az") (~ ("aeiou")))) "vowels") => #f} | 
| 
 | 
| \scheme{(irregex-match '(* (- (/ "az") ("aeiou"))) "xyzzy") => #<match>} | 
| 
 | 
| \scheme{(irregex-match '(* (- (/ "az") ("aeiou"))) "vowels") => #f} | 
| 
 | 
| \subsubsection{SRE Assertion Patterns} | 
| 
 | 
| There are a number of times it can be useful to assert something about | 
| the area around a pattern without explicitly making it part of the | 
| pattern. The most common cases are specifically anchoring some | 
| pattern to the beginning or end of a word or line or even the whole | 
| string. For example, to match on the end of a word: | 
| 
 | 
| \scheme{(irregex-search '(: "foo" eow) "foo") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "foo" eow) "foo!") => #<match>} | 
| 
 | 
| \scheme{(irregex-search '(: "foo" eow) "foof") => #f} | 
| 
 | 
| The \scheme{bow}, \scheme{bol}, \scheme{eol}, \scheme{bos} and \scheme{eos} work similarly. | 
| \scheme{nwb} asserts that you are not in a word-boundary - if replaced for | 
| \scheme{eow} in the above examples it would reverse all the results. | 
| 
 | 
| There is no \scheme{wb}, since you tend to know from context whether it | 
| would be the beginning or end of a word, but if you need it you can | 
| always use \scheme{(or bow eow)}. | 
| 
 | 
| Somewhat more generally, Perl introduced positive and negative | 
| look-ahead and look-behind patterns. Perl look-behind patterns are | 
| limited to a fixed length, however the IrRegex versions have no such | 
| limit. | 
| 
 | 
| \scheme{(irregex-search '(: "regular" (look-ahead " expression")) | 
| "regular expression") | 
| => #<match>} | 
| 
 | 
| The most general case, of course, would be an \scheme{and} pattern to | 
| complement the \scheme{or} pattern - all the patterns must match or the | 
| whole pattern fails. This may be provided in a future release, | 
| although it (and look-ahead and look-behind assertions) are unlikely | 
| to be compiled efficiently. | 
| 
 | 
| \subsubsection{SRE Utility Patterns} | 
| 
 | 
| The following utility regular expressions are also provided for common | 
| patterns that people are eternally reinventing. They are not | 
| necessarily the official patterns matching the RFC definitions of the | 
| given data, because of the way that such patterns tend to be used. | 
| There are three general usages for regexps: | 
| 
 | 
| \item*{searching} - search for a pattern matching a desired object in a larger text | 
| 
 | 
| \item*{validation} - determine whether an entire string matches a pattern | 
| 
 | 
| \item*{extraction} - given a string already known to be valid, extract certain fields from it as submatches | 
| 
 | 
| In some cases, but not always, these will overlap. When they are | 
| different, \scheme{irregex-search} will naturally always want the searching | 
| version, so IrRegex provides that version. | 
| 
 | 
| As an example where these might be different, consider a URL. If you | 
| want to match all the URLs in some arbitrary text, you probably want | 
| to exclude a period or comma at the tail end of a URL, since it's more | 
| likely being used as punctuation rather than part of the URL, despite | 
| the fact that it would be valid URL syntax. | 
| 
 | 
| Another problem with the RFC definitions is the standard itself may | 
| have become irrelevant. For example, the pattern IrRegex provides for | 
| email addresses doesn't match quoted local parts (e.g. "first | 
| last"@domain.com) because these are increasingly rare, and unsupported | 
| by enough software that it's better to discourage their use. | 
| Conversely, technically consecutive periods | 
| (e.g. first..last@domain.com) are not allowed in email addresses, but | 
| most email software does allow this, and in fact such addresses are | 
| quite common in Japan. | 
| 
 | 
| The current patterns provided are: | 
| 
 | 
| \pre{\scheme{ | 
| newline ; general newline pattern (crlf, cr, lf) | 
| integer ; an integer | 
| real ; a real number (including scientific) | 
| string ; a "quoted" string | 
| symbol ; an R5RS Scheme symbol | 
| ipv4-address ; a numeric decimal ipv4 address | 
| ipv6-address ; a numeric hexadecimal ipv6 address | 
| domain ; a domain name | 
| domain/common ; a domain ending in a common TLD like .com | 
| email ; an email address | 
| http-url ; a URL beginning with https?:// | 
| }} | 
| 
 | 
| Because of these issues the exact definitions of these patterns are | 
| subject to be changed, but will be documented clearly when they are | 
| finalized. More common patterns are also planned, but as what you | 
| want increases in complexity it's probably better to use a real | 
| parser. | 
| 
 | 
| \subsection{Supported PCRE Syntax} | 
| 
 | 
| Since the PCRE syntax is so overwhelming complex, it's easier to just | 
| list what we *don't* support for now. Refer to the | 
| \hyperlink[http://pcre.org/pcre.txt]{PCRE documentation} for details. You | 
| should be using the SRE syntax anyway! | 
| 
 | 
| Unicode character classes (\\P) are not supported, but will be | 
| in an upcoming release. \\C named characters are not supported. | 
| 
 | 
| Callbacks, subroutine patterns and recursive patterns are not | 
| supported. (*FOO) patterns are not supported and may never be. | 
| 
 | 
| \\G and \\K are not supported. | 
| 
 | 
| Octal character escapes are not supported because they are ambiguous | 
| with back-references - just use hex character escapes. | 
| 
 | 
| Other than that everything should work, including named submatches, | 
| zero-width assertions, conditional patterns, etc. | 
| 
 | 
| In addition, \\< and \\> act as beginning-of-word and end-of-word marks, | 
| respectively, as in Emacs regular expressions. | 
| 
 | 
| Also, two escapes are provided to embed SRE patterns inside PCRE | 
| strings, "\\'<sre>" and "(*'<sre>)". For example, to match a | 
| comma-delimited list of integers you could use | 
| 
 | 
| "\\\\'integer(,\\\\'integer)*" | 
| 
 | 
| and to match a URL in angle brackets you could use | 
| 
 | 
| "<('*http-url)>" | 
| 
 | 
| Note in the second example the enclosing "('*...)" syntax is needed | 
| because the Scheme reader would consider the closing ">" as part of | 
| the SRE symbol. | 
| 
 | 
| The following chart gives a quick reference from PCRE form to the SRE | 
| equivalent: | 
| 
 | 
| \schemeblock{ | 
| ;; basic syntax | 
| "^" ;; bos (or eos inside (?m: ...)) | 
| "$" ;; eos (or eos inside (?m: ...)) | 
| "." ;; nonl | 
| "a?" ;; (? a) | 
| "a*" ;; (* a) | 
| "a+" ;; (+ a) | 
| "a??" ;; (?? a) | 
| "a*?" ;; (*? a) | 
| "a+?" ;; (+? a) | 
| "a{n,m}" ;; (** n m a) | 
| 
 | 
| ;; grouping | 
| "(...)" ;; (submatch ...) | 
| "(?:...)" ;; (: ...) | 
| "(?i:...)" ;; (w/nocase ...) | 
| "(?-i:...)" ;; (w/case ...) | 
| "(?<name>...)" ;; (=> <name>...) | 
| 
 | 
| ;; character classes | 
| "[aeiou]" ;; ("aeiou") | 
| "[^aeiou]" ;; (~ "aeiou") | 
| "[a-z]" ;; (/ "az") or (/ "a" "z") | 
| "[[:alpha:]]" ;; alpha | 
| 
 | 
| ;; assertions | 
| "(?=...)" ;; (look-ahead ...) | 
| "(?!...)" ;; (neg-look-ahead ...) | 
| "(?<=...)" ;; (look-behind ...) | 
| "(?<!...)" ;; (neg-look-behind ...) | 
| "(?(test)pass|fail)" ;; (if test pass fail) | 
| "(*COMMIT)" ;; commit | 
| } | 
| 
 | 
| \subsection{Chunked String Matching} | 
| 
 | 
| It's often desirable to perform regular expression matching over | 
| sequences of characters not represented as a single string. The most | 
| obvious example is a text-buffer data structure, but you may also want | 
| to match over lists or trees of strings (i.e. ropes), over only | 
| certain ranges within a string, over an input port, etc. With | 
| existing regular expression libraries, the only way to accomplish this | 
| is by converting the abstract sequence into a freshly allocated | 
| string. This can be expensive, or even impossible if the object is a | 
| text-buffer opened onto a 500MB file. | 
| 
 | 
| IrRegex provides a chunked string API specifically for this purpose. | 
| You define a chunking API with | 
| 
 | 
| \subsubsection{(make-irregex-chunker <get-next> <get-string> [<get-start> <get-end> <get-substring> <get-subchunk>])} | 
| 
 | 
| where | 
| 
 | 
| \scheme{(<get-next> chunk) => } returns the next chunk, or \scheme{#f} if there are no more chunks | 
| 
 | 
| \scheme{(<get-string> chunk) => } a string source for the chunk | 
| 
 | 
| \scheme{(<get-start> chunk) => } the start index of the result of \scheme{<get-string>} (defaults to always 0) | 
| 
 | 
| \scheme{(<get-end> chunk) => } the end (exclusive) of the string (defaults to \scheme{string-length} of the source string) | 
| 
 | 
| \scheme{(<get-substring> cnk1 i cnk2 j) => } a substring for the range between the chunk \scheme{cnk1} starting at index \scheme{i} and ending at \scheme{cnk2} at index \scheme{j} | 
| 
 | 
| \scheme{(<get-subchunk> cnk1 i cnk2 j) => } as above but returns a new chunked data type instead of a string (optional) | 
| 
 | 
| There are two important constraints on the \scheme{<get-next>} procedure. | 
| It must return an \scheme{eq?} identical object when called multiple times | 
| on the same chunk, and it must not return a chunk with an empty string | 
| (start == end). This second constraint is for performance reasons - | 
| we push the work of possibly filtering empty chunks to the chunker | 
| since there are many chunk types for which empty strings aren't | 
| possible, and this work is thus not needed. Note that the initial | 
| chunk passed to match on is allowed to be empty. | 
| 
 | 
| \scheme{<get-substring>} is provided for possible performance improvements | 
| - without it a default is used. \scheme{<get-subchunk>} is optional - | 
| without it you may not use \scheme{irregex-match-subchunk} described above. | 
| 
 | 
| You can then match chunks of these types with the following | 
| procedures: | 
| 
 | 
| \subsubsection{(irregex-search/chunked <irx> <chunker> <chunk> [<start>])} | 
| \subsubsection{(irregex-match/chunked <irx> <chunker> <chunk> [<start>])} | 
| 
 | 
| These return normal match-data objects. | 
| 
 | 
| Example: | 
| 
 | 
| To match against a simple, flat list of strings use: | 
| 
 | 
| \schemeblock{ | 
| (define (rope->string rope1 start rope2 end) | 
| (if (eq? rope1 rope2) | 
| (substring (car rope1) start end) | 
| (let loop ((rope (cdr rope1)) | 
| (res (list (substring (car rope1) start)))) | 
| (if (eq? rope rope2) | 
| (string-concatenate-reverse ; from SRFI-13 | 
| (cons (substring (car rope) 0 end) res)) | 
| (loop (cdr rope) (cons (car rope) res)))))) | 
| 
 | 
| (define rope-chunker | 
| (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) | 
| car | 
| (lambda (x) 0) | 
| (lambda (x) (string-length (car x))) | 
| rope->string)) | 
| 
 | 
| (irregex-search/chunked <pat> rope-chunker <list-of-strings>) | 
| } | 
| 
 | 
| Here we are just using the default start, end and substring behaviors, | 
| so the above chunker could simply be defined as: | 
| 
 | 
| \schemeblock{ | 
| (define rope-chunker | 
| (make-irregex-chunker (lambda (x) (and (pair? (cdr x)) (cdr x))) car)) | 
| } | 
| 
 | 
| \subsubsection{(irregex-fold/chunked <irx> <kons> <knil> <chunker> <chunk> [<finish> [<start-index>]])} | 
| 
 | 
| Chunked version of \scheme{irregex-fold}. | 
| 
 | 
| \subsection{Utilities} | 
| 
 | 
| The following procedures are available in irregex-utils.scm. | 
| 
 | 
| \subsubsection{(irregex-quote <str>)} | 
| 
 | 
| Returns a new string with any special regular expression characters | 
| escaped, to match the original string literally in POSIX regular | 
| expressions. | 
| 
 | 
| \subsubsection{(irregex-opt <list-of-strings>)} | 
| 
 | 
| Returns an optimized SRE matching any of the literal strings | 
| in the list, like Emacs' \scheme{regexp-opt}. Note this optimization | 
| doesn't help when irregex is able to build a DFA. | 
| 
 | 
| \subsubsection{(sre->string <sre>)} | 
| 
 | 
| Convert an SRE to a PCRE-style regular expression string, if | 
| possible. | 
| 
 | 
| \section{Roadmap} | 
| 
 | 
| 0.6 - full PCRE support (DONE) | 
| 
 | 
| 0.7 - chunked string API (DONE) | 
| 
 | 
| 0.8 - utilities and API finalization (DONE) | 
| 
 | 
| 0.9 - refactoring, implementation-specific performance enhancements (DONE) | 
| 
 | 
| 1.0 - cleanup and better documentation | 
| 
 | 
| \section{License} | 
| 
 | 
| Copyright (c) 2005-2021 Alex Shinn | 
| All rights reserved. | 
| 
 | 
| Redistribution and use in source and binary forms, with or without | 
| modification, are permitted provided that the following conditions | 
| are met: | 
| 
 | 
| 1. Redistributions of source code must retain the above copyright | 
| notice, this list of conditions and the following disclaimer. | 
| 2. Redistributions in binary form must reproduce the above copyright | 
| notice, this list of conditions and the following disclaimer in the | 
| documentation and/or other materials provided with the distribution. | 
| 3. The name of the author may not be used to endorse or promote products | 
| derived from this software without specific prior written permission. | 
| 
 | 
| THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR | 
| IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES | 
| OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. | 
| IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY DIRECT, INDIRECT, | 
| INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT | 
| NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, | 
| DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY | 
| THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT | 
| (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF | 
| THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. | 
| 
 | 
| \section{References} | 
| 
 | 
| \bibitem{R5RS} R. Kelsey, W. Clinger, J. Rees (eds.) | 
| \hyperlink[http://www.schemers.org/Documents/Standards/R5RS/]{Revised^5 Report on the Algorithmic Language Scheme} | 
| 
 | 
| \bibitem{ImplementingRegexps} Russ Cox | 
| \hyperlink[http://swtch.com/~rsc/regexp/]{Implementing Regular Expressions} | 
| 
 | 
| \bibitem{Tcl} Russ Cox | 
| \hyperlink[http://compilers.iecc.com/comparch/article/07-10-026]{Henry Spencer's Tcl Regex Library} | 
| 
 | 
| \bibitem{SRE} Olin Shivers | 
| \hyperlink[http://www.scsh.net/docu/post/sre.html]{Proposed SRE regular-expression notation} | 
| 
 | 
| \bibitem{SCSH} Olin Shivers | 
| \hyperlink[http://www.scsh.net/docu/html/man-Z-H-7.html]{Pattern-matching strings with regular expressions} | 
| 
 | 
| \bibitem{Gauche} Shiro Kawai | 
| \hyperlink[http://practical-scheme.net/gauche/man/gauche-refe_49.html]{Gauche Scheme - Regular Expressions} | 
| 
 | 
| \bibitem{Perl6} Damian Conway | 
| \hyperlink[http://www.perl.com/pub/a/2002/08/22/exegesis5.html]{Perl6 Exegesis 5 - Regular Expressions} | 
| 
 | 
| \bibitem{PCRE} Philip Hazel | 
| \hyperlink[http://www.pcre.org/]{PCRE - Perl Compatible Regular Expressions} | 
| 
 | 
| \bibitem{tNFAs} Ville Laurikari | 
| \hyperlink[http://laurikari.net/ville/spire2000-tnfa.pdf]{NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions} | 
| 
 | 
| \bibitem{RegexpSubmatches} Ville Laurikari | 
| \hyperlink[http://laurikari.net/ville/regex-submatch.pdf]{Efficient submatch addressing for regular expressions} | 
| 
 |