36 Versatile parser and template library

9.2

Racket

36 Versatile parser and template library🔗ℹ

Keywords: grammar, parser, template.

syntax
(parse expr [pattern body …] …)

Analogous to syntax-parse, except it isn’t specialized for syntax, but rather works for arbitrary s-expressions, including syntax ones (denoted by #'(…) in the pattern).

syntax
(tmpl template)

template = variable
| [variable : type]
| (template . template)
| (template :: template)
| (template**)
| template**-dotted
| #(template**)
| (vector . template**-dotted)
| (hash . template**-dotted)
| (hasheq . template**-dotted)
| (hasheqv . template**-dotted)
| #hash((template . template))
| #hasheq((template . template))
| #hasheqv((template . template))
| (struct-id template …)
| (struct struct-id template …)
| #s(prefab-id template …)
| #s(template template …)
| #&template
| (~identifier args …)
| (~ expr args …)
| (unquoteexpr)
| (unquote-splicing(list expr))
| (unquote-splicing(list* expr))
| template-expander-id
| (template-expander-id args …)
| (?? alt-template …)
| (?@ . template**-dotted)
| (??@ . template**-dotted)
| (?if condition template template)
| (@if condition template template)
| (if@ condition template template)
| (@cond [condition template] …)
| (@cond [condition template] … [else template])
| (cond@ condition template template)
|
(~let ([meta-var+args template])
. template**-dotted)
| (~sort key template ooo)
| (~loc stxloc . template)
| (~each template template)
| (ddd escaped)
| #t
| #f
| string
| bytes
| number
| char
| keyword
| regexp
| pregexp

meta-var+args = meta-var
| (meta-var meta-arg …)

tail-template = template

variable = identifier

template**-dotted = (template* … . template)
| template**

template** = template* …
| template* … :: template
| template* … (~rest . template)

template* = template
| template ooo
| special-cased-template

special-cased-template = template vardd
| template ddvar

vardd = var..
| var...

ddvar = ..var
| ...var

ooo = ...
| ___
| ..k
| __k
| .. expr
| __ expr

ddd = ...

TODO: implement the versatile template library. ...

TODO: support for typed/racket.

TODO: optimization feature: would it be useful if the expanded code could be optimized? For example, when looking at the output of syntax-parse, the code is far from being concise.

The patterns for parse should all have a way to create a symmetric counterpart for tmpl, which produces the original value. This symmetry is important because it allows lens-like macros, which operate on only a part of the data structure, leaving everything else intact.

?? works like ?? from syntax/parse/experimental/template, except it allows any number of alternatives (including 0, to avoid special-casing in macros). It is more or less equivalent to (?? a (?? b (?? c …))), following syntax/parse’s semantics.

?@ has the same meaning as in syntax/parse.

(??@ t* …) is a shortcut for (?? (?@ t* …))

For better compatibility with at-exp, @if can be written if@, and the same goes for @cond etc.

TODO: what’s the difference between ~, template-expander and unquote? template-expander runs at compile-time and should treat its arguments as syntax.

Concerning unquoting, unlike racket’s default behaviour in #'([x #,(y …)] …), unquoting should not break the nesting of ellipses. How should we express voluntary variation of the level of nesting? ~let already allows expanding part of the template at some level and inserting it verbatim somewhere below, but it’s not a silver bullet. One case which comes to mind is when some of the nested data should be mixed with less-nested data, for example going from ([10 1 2 3] [100 4 5] [1000 6]) to ([10 20 30] [400 500] [6000]) should be relatively easy to express. Maybe ~let with parameters can be a suitable generalized solution: ({~let ([(addx v) #,(+ x v)]) [(addx y) …]} …)

The special-cased template syntax should allow special treatment of the i-th iteration in a doubly-nested loop: matching x on (1 2 3 4 5), and using the template (0 x.. (unquote (* x x)) ..x 1) … will produce (1 1 1 1 1) (0 4 1 1 1) (0 0 9 1 1) (0 0 0 16 1) (0 0 0 0 24). The pattern before x.. and the pattern after ..x can expand to multiple items which will be spliced in by wrapping it with ?@.

36.1 Ideas for implementation🔗ℹ

36.1.1 Extensibility (expanders)🔗ℹ

Allow normal, inline-prefix, inline-postfix and inline-infix expanders, which can bind using regular expressions. This allows implementing exotic syntax like var.. (postfix, operates on the pattern preceeding it), ..var (postfix, operates on the pattern after it), (… escaped-pattern) (normal, operates on the containing s-exp)

36.1.2 Customization🔗ℹ

For things that are likely to be customized by the user in the whole file scope, define a grammar/custom module, used as follows:

(require grammar/custom)
(grammar/custom option …)

The grammar/custom macro expands to (require grammar/core) followed by a bunch of define-syntax which wrap the core macros, providing them the custom options:

(require grammar/core)
(define-syntax-rule (parse . rest)
(parse/core #:global-options (option …) . rest))
(define-syntax-rule (tmpl . rest)
(parse/core #:global-options (option …) . rest))

This can also be used to rename the parse and tmpl macros, if desired (for example, tmpl could be renamed to quasisyntax, or something similar).

Otherwise, grammar/custom could just set! some for-syntax variable which stores the options. A second boolean for-syntax variable could be used to check if grammar/custom was called twice, and throw an error in that case.

Or maybe we should just use units? Can they be customized in a similar way?

The idea is to avoid having to wrap the whole file in a (parameterize …), and be able to easily provide a customized variation of this library:

(provide (customized-out grammar/custom))

36.1.3 Unsorted ideas🔗ℹ

36.1.3.1 Global pattern constraints🔗ℹ

For patterns, have global constraints: (~global-or id) binds id to true if the enclosing pattern was matched at least once, and false otherwise. Multiple occurrences of the same (~global-or id) make the id true if any of the containing clauses was matched at least once.

Inside a {~no-order}, it should be possible to impose some partial order constraints, so that we can say:

{~no-order
{~optional pat-a}
{~optional pat-b}
pat-c
{~optional {~constrain pat-d {~after pat-a}}}}

The above code means that pat-a, pat-b and pat-d are optional (but not pat-c), the four patterns can appear in any order, but if pat-a and pat-d are both present, then pat-d must appear after pat-a.

Scopes: the global constraints apply within a scope. By default, there is an implicit top-level scope, and some forms might implicitly introduce a catch-all scope unless otherwise specified, like the implicit ~demimit-cut for define-syntax-class from syntax/parse. There could be two kinds of scopes: unhygienic catch-all scopes which scope all "global" constraints within, and naming scopes, which explicitly say which identifiers they scope.

{~scope {a}
{~vector
{~scope {b} {~no-order {~once a} {~optional b}}}
{~scope {b} {~no-order {~once a} {~optional b}}}}}

The code above matches against a vector of two ~no-order lists. The a pattern must appear exactly once, either in the first list or in the second, but not in both. On the other hand, the b pattern may appear zero or one time in the first list, zero or one time in the second list, and may appear in both since its constraint is scoped for each list. Although it is less clear, the following code is semantically identical:

{~scope {a b}
{~vector
{~no-order {~once a} {~optional b}}
{~scope {b} {~no-order {~once a} {~optional b}}}}}

Since the b in the "~no-order" is bound to the enclosing {~scope {b} …}, it does not interact in any way with the outer scope. The ~optional constraint on the b in the first ~no-order therefore does not interact withe the ~optional constraint in the second ~no-order.

36.1.3.2 Generalization of pattern/template kinds🔗ℹ

Nearly all patterns and templates should work equally well for regular lists and syntax objects. It should be possible and easy enough to create new "kinds" of data, which modify how patterns and templates work all the way through the pattern or template tree, until it switches to a new kind. As an example, the following pattern starts as a normal s-expr pattern, and switches to syntax in two nodes:

{~s-expr 1 2 (buckle {~optional my} shoe)
3 4 {~syntax (knock {~optional at the} door)}
5 6 (pick {~optional-wrap (up _) (sticks)})
7 8 {~syntax (lay {~optional-wrap (them _) (straight)})}}

That pattern should match the following value:

`(1 2 (buckle shoe)
    3 4 ,#'(knock door)
    5 6 (pick (up (sticks)))
    7 8 ,#'(lay (them (straight))))

The ~syntax indicates that the whole subtree should start matching (or producing) syntax objects, instead of regular s-expressions. It is worht noting that syntax objects have extra information (source location, syntax properties) that regular s-expressions lack. One way of implementing this would be to make the pattern directives operate on "enhanced" s-expressions. Enhanced s-expressions are s-expressions with arbitrary kind-specific data attached to them. The ~s-expr simply translates s-expressions into enhanced s-expressions with an empty data attached, while ~syntax is a sort of pre-processor which turns syntax objects into enhanced s-expressions with source location and syntax properties attached. These "kind" pre-processors run before the normal pattern directives are applied. Some kind-specific pattern directives can access those properties (if they are used in within the scope of the appropriate ~kind), so that a (~loc srcloc . pattern) matches pattern and saves its source location into the variable srcloc.

Kinds should also be able to alter how the pattern variables are bound: ~s-expr simply binds (in patterns) and uses (in templates) normal Racket variables. On the other hand, ~syntax binds and uses syntax pattern variables, so that the bound variables are used as #'var instead of var.

Different pattern and template forms can specify a default kind (possibly by simply wrapping their pattern or tempalte with the appropriate ~kind). For example, a define/match form would use ~s-expr by default, whereas a define-syntax/match would use ~syntax. The same would apply for re-implementations of Racket’s match and syntax-parse.

Do the "kinds" form some sort of monad? TODO: Think about this, and try to see if there are some monads which can be translated to pattern/template kinds usefully.

36.1.3.3 Lenses🔗ℹ

It should be possible to describe lenses using the patterns: you can work on the focused part of the match, possibly access (read-only) other parts, and return a new value. What should happen when the focused part is under an ellipsis and has more than one match ? Implicitly execute the code n times, like a sort of for/list?

36.1.3.4 Backtracking🔗ℹ

Since the parser may need to backtrack, we need to expose the backtracking mechanism to the user in some way, so that the user can:

Cut the current branch
Perform some side-effects and undo them when backtracking (dangerous)
Record a side-effectful lambda which is executed when the match succeeds or when the current branch is ~committed.
Querry information about the previously failed branches
Maybe affect the order in which non-deterministic branches are taken. This feature would mainly be used by optimizers.
As a toy "just because we can" example, the backtracking mechanism should be configurable enough that some CSP algorithm like AC2003 can be expressed by the user, turning the pattern library into a CSP solver (where the CSP problem is expressed as a pattern over an empty object). Another toy "just because we can" example would be a datalog implementation built upon this library, where the deduction rules are expressed as patterns.
The goal is that the parser’s backtracking mechanism should be modular enough to allow us to implement a dead-simple unoptimized backtracker, and allow optimizers to be written as plug-ins. For example, an optimiazer could statically detect branches that can be cut due to a prior failure (e.g. if the two-element-list pattern (foo:id bar:number) failed because the first element was not an identifier?, there’s no point in trying (baz:id quux:string fuzz:number) on the same term.
Extensive configurability of the backtracking mechanism and optimization features may interact badly with partial application and partial compilation, see below. Think it through before giving too much or too little expressivity to the user.

36.1.3.5 Partial application🔗ℹ

It should be possible to give a partial input with holes to a pattern or template form, and, for optimization purposes, request that the pattern or template processes the input as much as it can (for the parser, it would potentially open a bounded number of backtracking branches, ready to switch to the next one if one fails), leaving an efficient "continuation".

36.1.3.6 Partial compilation🔗ℹ

One of the drawbacks of syntax/parse is that compiling a syntax-parse form takes some non-negligible time. This means that if a macro generates another macro, and the generated macro code uses syntax-parse, each call to the "generator" macro will be expensive. A complex macro generating syntax which contains hundreds of uses of syntax-case will be reasonnably fast. The same code using syntax-parse will be much slower. Since the generated uses of syntax-parse will all have the same "shape" with a few identifiers etc. changing, it would be nice to be able to partially pre-expand a use of syntax-parse, leaving only the "holes" to be expanded. With a bottom-up expansion mechanism there’s not much to do, so we have to try hard to make the pattern / template expander top-down as much as possible, and/or use a lazy language (for which most things can be evaluated, leaving a continuation for the few things that actually depend on the holes).

Although partial compilation sounds like a very interesting academic project, it might be too difficult to get something useful out of it in practice. An alternative, which would procude the sought performance benefits for macros generating code which uses the pattern/template library, would be to make as many of the concepts first-class, so that they can easily be supplied as a parameter. Note that firs-class in this case does not necessarily mean "run-time first-class", but possibly "compile-time first-class": we only need to be able to pre-declare parametric templates, then use them in the code generated by a macro. As long as the parametric templates support a form of "separate compilation" and optimization, filling in the parameters can be handled by a fast macro.

Some of the optimization plug-ins may however rely on a closed-world assumption (i.e. they want to have the whole, final pattern or template, in order to optimize it). If such an optimization plug-in is used, we may have to fall back to the idea of using partial compilation, or simply accept that macros which generate such code will take a while to expand.

36.1.3.7 QuickCheck test generation🔗ℹ

It should be possible to generate random data that matches (and does not match, too, that’s a distinct problem) a pattern (unless there’s a user-provided predicate that is opaque to the library, in which case we can just ignore it and generate instances at random, hoping that some will match and some won’t).

Combined with the fact that pattern directives should be reversible into template directives, and vica versa, it means that each directive should also express its set of accepted values in terms of its contents. Of course, we don’t expect to be able to uniformly sample random instances, nor do we expect to be able to support in a useful way complex patterns with lots of opaque predicates.

36.1.3.8 Error messages🔗ℹ

syntax/parse generates good error messages, but it does not work as well when the patterns become complex. Think this through, so that the annotation burden is minimal, and so that users don’t have to think too hard about where to put a ~describe (I frequently had the problem with syntax/parse where I wrote a ~describe, but it wasn’t taken into account.

36.1.4 Things to look at🔗ℹ

math/arry, for :: and array broadcasting.
Quasipatterns in match.
The lens library
https://github.com/racket/racket/issues/1304 non-linear matching (with repeated binding variables, for example, that should be eq? or equal?)

1	Aliases for other racket identifiers
2	backtrace
3	Compatibility wrappers
4	cond-let
5	contract
6	eval-get-values
7	Fixnum operations (fxxor …)
8	generate-indices
9	Generating identifiers
10	in
11	list-lang
12	list
13	logn-id
14	Miscellaneous utilities
15	multiassoc-syntax
16	not-implemented-yet
17	let-in binding and destructuring form
18	repeat-stx
19	require-provide
20	sequence
21	set
22	Syntax object manipulation utilities
23	syntax-parse helpers
24	test-framework
25	threading
26	tmpl
27	Utilities and patches for typed/ rackunit
28	Extensions for typed/ rackunit
29	typed-untyped
30	type-inference-helpers
31	values
32	Untyped versions of the modules
33	for/ list
34	meta operations on structs
35	Formatting identifiers so that Dr Racket still shows arrows
36	Versatile parser and template library