1 Lexers for Token Parsing
import: parser/lex | package: rhombus-parser-lib |
A lexer tokenizes an input stream using regular expressions. The lexer form creates a lexer based on a set of regular-expression lexer patterns and an action for each pattern to produce its result.
The lexer pattern language of lexer resembles the pattern language of rx, but it is different in some ways, because the matching engine implementing lexer is tuned for tokenization: it finds the longest match among a set of regular expression, and it uses a different algorithm than rx internally to find that match. Still, the same syntax is used as much as possible, and the rx_charset space is used directly to express character sets.
expression | ||||||||
| ||||||||
| ||||||||
| ||||||||
| ||||||||
|
If reading from input_port immediately produces Port.eof, then the body whose trigger is ~eof is evaluated. If there is no such trigger among the lexer clauses, Port.eof is returned. The port’s Port.eof is consumed. At most one trigger can be ~eof.
If reading some number of characters from input_port matches a trigger as pat, and if it is either the longest such match or the first among pats that match the same input, then then corresponding body result is produced. If reading from input_port produces an immediate end-of-file, no empty-string pat matches are attemped, and a ~eof clause (if any) is used, instead. The matched characters from input_port are consumed, and only those characters are consumed.
If reading some number of characters from input_port matches no trigger, then an exception is thrown after a unspecified number of characters are consumed.
If any pat as a trigger could match an empty string, a syntax error is reported, unless the ~allow_empty option is present.
A body sequence can use lexeme to refer to the matched string (except in an ~eof clause), and it can use line, column, position, span, srcloc, line_span, and/or end_column for location information relative to input_port for the read lexeme; note that line and column counting need to have been enabled for the port via Port.locations_enabled. The input_port form can be used to refer to the input_port provided to the lexer, which is useful for recursive parsing using another lexer or some other parser.
A pat is a lexer pattern whose operators are bound in the lex_pattern space. A literal string matches the same string as input (via #%literal), operators like * and + support repetitions, character sets are support in [] (via #%brackets), adjacent patterns are treated as concatenation (via #%juxtapose), and so on. Reusable shorthands and new pattern forms can be defined with lex_pattern.macro.
> def lex:
| ~eof: #'done
> def i = Port.Input.open_string("heLlo")
> lex(i)
["heLlo", 1, 5]
> lex(i)
#'done
expression | |
| |
expression | |
| |
expression | |
| |
expression | |
| |
expression | |
| |
expression | |
| |
expression | |
| |
expression | |
| |
expression | |
lexeme :: String: the matched input string, not allowed in a ~eof clause.
line :: maybe(PosInt): the line number within the input for the start of lexeme, assuming that the input port has location counting enabled.
column :: maybe(NonnegInt): the column within the input for the start of lexeme, assuming that the input port has location counting enabled.
position :: maybe(PosInt): the position within the input for the start of lexeme, which is normally available even without location counting enabled (in which case it’s a byte count, instead of a character count).
span :: maybe(NonnegInt): the length of lexeme in characters or bytes, depending on whether location counting is enabled.
line_span :: maybe(NonnegInt): difference between the ending and starting lines for lexeme if location counting is enabled.
end_column :: maybe(NonnegInt): the column for the end of lexeme if location counting is enabled.
input_port :: Port.Input: the input port provided to the lexer for the current call.
space | |
lex pattern operator | |
| |
| |
|
> def lex:
| "hello": #'hi
| "bye": #'bye
> def i = Port.Input.open_string("hellobye")
> lex(i)
#'hi
> lex(i)
#'bye
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
| |
|
> def lex:
| "hel" "lo": #'hi
> def i = Port.Input.open_string("hellobye")
> lex(i)
#'hi
> lex(i)
#'bye
lex pattern operator | |
| |
| |
|
> def lex:
> def i = Port.Input.open_string("hellobye")
> lex(i)
"hello"
> lex(i)
"bye"
lex pattern operator | |
| |
| |
|
lex pattern operator | |
| |
| |
|
lex pattern operator | |
|
lex pattern operator | |
| |
| |
|
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
| |
|
See Regexp Character Sets for character set forms that can be used in charset.
> def lex:
> def i = Port.Input.open_string("amB")
> lex(i)
#'alpha
> lex(i)
#'alpha
> lex(i)
lexer: No match found in input starting with: B
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
|
*: 0 or more
+: 1 or more
?: 0 or 1
{count}: exactly count
{min ..}: min or more
{min ..= max}: between min and max (inclusive)
> def lex:
| "z" "c"{3}: ["3", lexeme]
> def i = Port.Input.open_string("xaaxybyzcccccczccczc")
> lex(i)
["+", "xaa"]
> lex(i)
["*", "x"]
> lex(i)
["?", "yb"]
> lex(i)
["?", "y"]
> lex(i)
["4+", "zcccccc"]
> lex(i)
["3", "zccc"]
> lex(i)
["1-2", "zc"]
lex pattern operator | |
> def lex:
> def i = Port.Input.open_string("aBC\n0_z!")
> lex(i)
"aBC\n0_z"
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
| |
lex pattern operator | |
> lex_pattern.macro 'octal':
'["0"-"7"]'
> def lex:
| "0" octal+: [lexeme, String.to_int(lexeme, ~radix: 8)]
> def i = Port.Input.open_string("04448")
> lex(i)
["0444", 292]
A compile-time value that identifies the same space as lex_pattern. See also SpaceMeta.
definition | |
syntax class | |
| |
syntax class | |
| |
| |
syntax class | |
|
Analogous to expr_meta.Parsed, etc., but for lexer patterns.