On this page:
lexer
lexeme
line
column
position
span
srcloc
line_  span
end_  column
input_  port
lex_  pattern
#%literal
#%juxtapose
+  +
#%call
||
&&
-
!
#%parens
#%brackets
#%index
*
+
?
#%comp
..
..=
any
alpha
upper
lower
digit
xdigit
alnum
word
blank
newline
space
graph
print
cntrl
ascii
latin1
unicode.Ll
unicode.Lu
unicode.Lt
unicode.Lm
unicode.Lx
unicode.Lo
unicode.L
unicode.Nd
unicode.Nl
unicode.No
unicode.N
unicode.Ps
unicode.Pe
unicode.Pi
unicode.Pf
unicode.Pc
unicode.Pd
unicode.Po
unicode.P
unicode.Mn
unicode.Mc
unicode.Me
unicode.M
unicode.Sc
unicode.Sk
unicode.Sm
unicode.So
unicode.S
unicode.Zl
unicode.Zp
unicode.Zs
unicode.Z
unicode.Cc
unicode.Cf
unicode.Cs
unicode.Cn
unicode.Co
unicode.C
lex_  pattern_  meta.space
lex_  pattern.macro
lex_  pattern_  meta.Parsed
lex_  pattern_  meta.After  Prefix  Parsed
lex_  pattern_  meta.After  Infix  Parsed
8.18

1 Lexers for Token Parsing🔗ℹ

 import: parser/lex package: rhombus-parser-lib

A lexer tokenizes an input stream using regular expressions. The lexer form creates a lexer based on a set of regular-expression lexer patterns and an action for each pattern to produce its result.

The lexer pattern language of lexer resembles the pattern language of rx, but it is different in some ways, because the matching engine implementing lexer is tuned for tokenization: it finds the longest match among a set of regular expression, and it uses a different algorithm than rx internally to find that match. Still, the same syntax is used as much as possible, and the rx_charset space is used directly to express character sets.

expression

lexer maybe_option

| trigger:

    body

    ...

| ...

 

trigger

 = 

pat

 | 

~eof

 

maybe_option

 = 

ϵ

 | 

: ~allow_empty

Returns a lexer as a function Port.Input -> Any that takes an input_port and returns the result of one of the body sequences:

  • If reading from input_port immediately produces Port.eof, then the body whose trigger is ~eof is evaluated. If there is no such trigger among the lexer clauses, Port.eof is returned. The port’s Port.eof is consumed. At most one trigger can be ~eof.

  • If reading some number of characters from input_port matches a trigger as pat, and if it is either the longest such match or the first among pats that match the same input, then then corresponding body result is produced. If reading from input_port produces an immediate end-of-file, no empty-string pat matches are attemped, and a ~eof clause (if any) is used, instead. The matched characters from input_port are consumed, and only those characters are consumed.

  • If reading some number of characters from input_port matches no trigger, then an exception is thrown after a unspecified number of characters are consumed.

If any pat as a trigger could match an empty string, a syntax error is reported, unless the ~allow_empty option is present.

A body sequence can use lexeme to refer to the matched string (except in an ~eof clause), and it can use line, column, position, span, srcloc, line_span, and/or end_column for location information relative to input_port for the read lexeme; note that line and column counting need to have been enabled for the port via Port.locations_enabled. The input_port form can be used to refer to the input_port provided to the lexer, which is useful for recursive parsing using another lexer or some other parser.

A pat is a lexer pattern whose operators are bound in the lex_pattern space. A literal string matches the same string as input (via #%literal), operators like * and + support repetitions, character sets are support in [] (via #%brackets), adjacent patterns are treated as concatenation (via #%juxtapose), and so on. Reusable shorthands and new pattern forms can be defined with lex_pattern.macro.

> def lex:

    lexer

    | "he" ["l" "L"]+ "o": [lexeme, position, span]

    | ~eof: #'done

> def i = Port.Input.open_string("heLlo")

> lex(i)

["heLlo", 1, 5]

> lex(i)

#'done

expression

lexeme

 

expression

line

 

expression

column

 

expression

position

 

expression

span

 

expression

srcloc

 

expression

line_span

 

expression

end_column

 

expression

input_port

Thexe identifiers are for use within a lexer clause’s body sequence, and using them elsewhere is a syntax error. They provide information about the match that reached the body sequence of a lexer clause.

The space for lexer pattern operators that can be used within lexer forms.

lex pattern operator

#%literal string

 

~stronger_than: ~other

A literal string as a pattern matches the string’s characters literally.

> def lex:

    lexer

    | "hello": #'hi

    | "bye": #'bye

> def i = Port.Input.open_string("hellobye")

> lex(i)

#'hi

> lex(i)

#'bye

lex pattern operator

pat #%juxtapose pat

 

lex pattern operator

pat ++ pat

 

lex pattern operator

pat #%call (pat)

 

~order: rx_concatenation

Patterns that are adjacent in a larger pattern match in sequence. The ++ operator can be used to make sequencing explicit. An implicit #%call form is treated like #%juxtapose, consistent with implicit uses of parentheses for grouping as handled by #%parens.

> def lex:

    lexer

    | "hel" "lo": #'hi

    | "b" ++ "y" ("e"): #'bye

> def i = Port.Input.open_string("hellobye")

> lex(i)

#'hi

> lex(i)

#'bye

lex pattern operator

pat || pat

 

~order: rx_disjunction

Matches the union of charater sequences matched by the first pat and second pat.

> def lex:

    lexer

    | "hello" || "bye": lexeme

> def i = Port.Input.open_string("hellobye")

> lex(i)

"hello"

> lex(i)

"bye"

lex pattern operator

pat && pat

 

~order: rx_conjunction

Matches the intersection of charater sequences matched by the first pat and second pat.

lex pattern operator

pat - pat

 

~order: rx_subtraction

Matches the charater sequences matched by the first pat that are not also matched by the second pat.

lex pattern operator

! pat

Matches the charater sequences that are not matched by pat.

lex pattern operator

#%parens (pat)

 

~order: rx_concatenation

A parenthesized pattern is equivalent to the pat inside the parentheses.

lex pattern operator

#%brackets [charset]

 

lex pattern operator

pat #%index [charset]

 

~order: rx_concatenation

A [] pattern, which is an implicit use of #%brackets, matches a single character, where charset determines the matching characters or bytes. An implicit #%index form is treated as a sequence of a pat and #%brackets.

See Regexp Character Sets for character set forms that can be used in charset.

> def lex:

    lexer

    | ["a"-"z"]: #'alpha

> def i = Port.Input.open_string("amB")

> lex(i)

#'alpha

> lex(i)

#'alpha

> lex(i)

lexer: No match found in input starting with: B

lex pattern operator

pat *

 

lex pattern operator

pat +

 

lex pattern operator

pat ?

 

lex pattern operator

pat #%comp {count}

 

lex pattern operator

pat #%comp {min ..}

 

lex pattern operator

pat #%comp {min ..= max}

 

~order: rx_repetition

Matches a sequence of matches to pat:

  • *: 0 or more

  • +: 1 or more

  • ?: 0 or 1

  • {count}: exactly count

  • {min ..}: min or more

  • {min ..= max}: between min and max (inclusive)

> def lex:

    lexer

    | "x" "a"+: ["+", lexeme]

    | "x" "a"*: ["*", lexeme]

    | "y" "b"?: ["?", lexeme]

    | "z" "c"{3}: ["3", lexeme]

    | "z" "c"{1 ..= 2}: ["1-2", lexeme]

    | "z" "c"{4 ..}: ["4+", lexeme]

> def i = Port.Input.open_string("xaaxybyzcccccczccczc")

> lex(i)

["+", "xaa"]

> lex(i)

["*", "x"]

> lex(i)

["?", "yb"]

> lex(i)

["?", "y"]

> lex(i)

["4+", "zcccccc"]

> lex(i)

["3", "zccc"]

> lex(i)

["1-2", "zc"]

lex pattern operator

..

 

lex pattern operator

..=

Only allowed within a {} repetition form.

lex pattern operator

any

Matches a single character.

> def lex:

    lexer

    | "a" any* "z": lexeme

> def i = Port.Input.open_string("aBC\n0_z!")

> lex(i)

"aBC\n0_z"

lex pattern operator

alpha

 

lex pattern operator

upper

 

lex pattern operator

lower

 

lex pattern operator

digit

 

lex pattern operator

xdigit

 

lex pattern operator

alnum

 

lex pattern operator

word

 

lex pattern operator

blank

 

lex pattern operator

newline

 

lex pattern operator

space

 

lex pattern operator

graph

 

lex pattern operator

print

 

lex pattern operator

cntrl

 

lex pattern operator

ascii

 

lex pattern operator

latin1

 

lex pattern operator

unicode.Ll

 

lex pattern operator

unicode.Lu

 

lex pattern operator

unicode.Lt

 

lex pattern operator

unicode.Lm

 

lex pattern operator

unicode.Lx

 

lex pattern operator

unicode.Lo

 

lex pattern operator

unicode.L

 

lex pattern operator

unicode.Nd

 

lex pattern operator

unicode.Nl

 

lex pattern operator

unicode.No

 

lex pattern operator

unicode.N

 

lex pattern operator

unicode.Ps

 

lex pattern operator

unicode.Pe

 

lex pattern operator

unicode.Pi

 

lex pattern operator

unicode.Pf

 

lex pattern operator

unicode.Pc

 

lex pattern operator

unicode.Pd

 

lex pattern operator

unicode.Po

 

lex pattern operator

unicode.P

 

lex pattern operator

unicode.Mn

 

lex pattern operator

unicode.Mc

 

lex pattern operator

unicode.Me

 

lex pattern operator

unicode.M

 

lex pattern operator

unicode.Sc

 

lex pattern operator

unicode.Sk

 

lex pattern operator

unicode.Sm

 

lex pattern operator

unicode.So

 

lex pattern operator

unicode.S

 

lex pattern operator

unicode.Zl

 

lex pattern operator

unicode.Zp

 

lex pattern operator

unicode.Zs

 

lex pattern operator

unicode.Z

 

lex pattern operator

unicode.Cc

 

lex pattern operator

unicode.Cf

 

lex pattern operator

unicode.Cs

 

lex pattern operator

unicode.Cn

 

lex pattern operator

unicode.Co

 

lex pattern operator

unicode.C

Each of these names is bound both as a character set and as a pattern that can be used directly, instead of wrapping in []. See the alpha, etc., character set for more information.

> lex_pattern.macro 'octal':

    '["0"-"7"]'

> def lex:

    lexer

    | "0" octal+: [lexeme, String.to_int(lexeme, ~radix: 8)]

> def i = Port.Input.open_string("04448")

> lex(i)

["0444", 292]

Provided as meta.

A compile-time value that identifies the same space as lex_pattern. See also SpaceMeta.

Like expr.macro, but defines a new lexer pattern operator.

Provided as meta.

Analogous to expr_meta.Parsed, etc., but for lexer patterns.