Lexer and yacc examples
1 Description
2 Reverse Polish Notation Calculator
2.1 BNF
2.2 Lexer
2.2.1 Two types of tokens
2.2.2 lexer-src-pos
2.2.3 Testing the lexer
2.3 Parser
2.3.1 Explanation of exp grammar
2.3.2 The other components of parser
2.3.3 parse function
2.3.4 Testing the parser
2.4 Language
2.4.1 Reader
2.4.2 lex-yacc-example/  rpcalc
2.4.3 Main
2.4.4 Running rpcalc
3 Infix Notation Calculator
3.1 BNF
3.2 Lexer
3.3 Parser
3.3.1 Testing the parser
3.4 Language
3.4.1 Reader
3.4.2 Main
3.4.3 Running calc
4 Multi-Function Calculator
4.1 BNF
4.2 Lexer
4.3 funs.rkt
4.4 Parser
4.5 Language
4.5.1 Expander
4.5.2 Testing s-exp
4.5.3 Reader
4.5.4 Main
4.5.5 Running mfcalc
5 Conclusion
8.2

Lexer and yacc examples

by Petter Olav Pripp <petter.pripp@yahoo.com>

Copyright (C) 2021 - Petter Olav Pripp

The source code is distributed under the GNU General Public License.

The source code is at

https://github.com/petterpripp/lex-yacc-example

Any suggestion or corrections are welcome.

1 Description

The examples show use of lexer and yacc-style parser. Useful for making syntax in a domain-specific language.

There will be three languages:
  • lex-yacc-example/rpcalc

  • lex-yacc-example/calc

  • lex-yacc-example/mfcalc

2 Reverse Polish Notation Calculator

Based on GNU Bison RPN Calc example.

https://www.gnu.org/software/bison/manual/bison.html#RPN-Calc

The example is that of a simple double-precision Reverse Polish Notation calculator (a calculator using postfix operators). This example provides a good starting point, since operator precedence is not an issue.

2.1 BNF

 

exp

 ::= 

number

 

  |  

exp exp +

 

  |  

exp exp -

 

  |  

exp exp *

 

  |  

exp exp /

 

  |  

exp exp ^

 

  |  

exp -

2.2 Lexer

The lexer translate code to tokens. This will be input to the parser. Below is the full code for the lexer. In the next sections we will look into the code.

#lang racket
 
(require parser-tools/lex
         (prefix-in : parser-tools/lex-sre))
 
(define-tokens value-tokens (NUMBER))
(define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER NEG))
 
(define next-token
  (lexer-src-pos
   [(eof) (token-EOF)]
   [(:+ whitespace) (return-without-pos (next-token input-port))]
   [#\+ (token-ADD)]
   [#\- (token-SUBTRACT)]
   [#\* (token-PRODUCT)]
   [#\/ (token-DIVISION)]      
   [#\^ (token-POWER)]
   [#\n (token-NEG)]
   [(:: (:+ numeric) (:* (:: #\. (:+ numeric) ))) (token-NUMBER (string->number lexeme))]))
 
 
(provide value-tokens op-tokens next-token)
 
2.2.1 Two types of tokens
(define-tokens value-tokens (NUMBER))
(define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER NEG))
2.2.2 lexer-src-pos

The lexer uses regular expressions from (require parser-tools/lex (prefix-in : parser-tools/lex-sre)). When multiple patterns match, a lexer will choose the longest match, breaking ties in favor of the rule appearing first. The lexer will return tokens with source information. Below is explanation of some of the rules.

[#\+ (token-ADD)]

When lexer finds ’+’ it will return token ADD.

[(:+ whitespace) (return-without-pos (next-token input-port))]

Recursively call the lexer on the remaining input after a tab or space. Returning the result of that operation. This effectively skips all whitespace.

[(:: (:+ numeric) (:* (:: #\. (:+ numeric))))  (token-NUMBER (string->number lexeme))]

The lexer return both token-id NUMBER and the number combined to one value-token.

2.2.3 Testing the lexer
#lang racket
 
(require "lexer.rkt" parser-tools/lex)
 
(define (lex-test ip)
  (port-count-lines! ip)
  (letrec ([one-line
            (lambda ()
              (let ([result (next-token ip)])
                (unless (equal? (position-token-token result) 'EOF)
                  (printf "~a\n" result)
                  (one-line)
                  )))])
    (one-line)))
 
(define (my-lex-test str)
    (lex-test (open-input-string str)))
    
(provide my-lex-test)
 

Examples:
> (my-lex-test "1 + * -")

#(struct:position-token #(struct:token NUMBER 1) #(struct:position 1 1 0) #(struct:position 2 1 1))

#(struct:position-token ADD #(struct:position 3 1 2) #(struct:position 4 1 3))

#(struct:position-token PRODUCT #(struct:position 5 1 4) #(struct:position 6 1 5))

#(struct:position-token SUBTRACT #(struct:position 7 1 6) #(struct:position 8 1 7))

> (my-lex-test "3.14 +\n ^ # -")

#(struct:position-token #(struct:token NUMBER 3.14) #(struct:position 1 1 0) #(struct:position 5 1 4))

#(struct:position-token ADD #(struct:position 6 1 5) #(struct:position 7 1 6))

#(struct:position-token POWER #(struct:position 9 2 1) #(struct:position 10 2 2))

lexer: No match found in input starting with: #

2.3 Parser

This is the full code for the parser. In the next sections we will look into the code.

#lang racket
 
(require parser-tools/yacc  "lexer.rkt")
 
(define myparser
  (parser
 
   (start exp)
   (end EOF)
   (tokens value-tokens op-tokens )
   (src-pos)
   (error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void))))   
   
   (grammar
 
    (exp  [(NUMBER) $1]
          [(exp exp ADD) (+ $1 $2)]
          [(exp exp SUBTRACT) (- $1 $2)]
          [(exp exp PRODUCT) (* $1 $2)]
          [(exp exp DIVISION) (/ $1 $2)]
          [(exp exp POWER) (expt $1 $2)]
          [(exp NEG) (- $1)]))))
             
(define (parse ip)
  (port-count-lines! ip)  
  (myparser (lambda () (next-token ip))))   
 
(provide parse )
 
2.3.1 Explanation of exp grammar

Grammar can have many grouping. In this example it has only exp. The exp grouping has several rules, one for each kind of expression. The first rule handles the simplest expressions: those that are just numbers. The second handles an addition-expression, which looks like two expressions followed by a plus-sign. The third handles subtraction, and so on.

(grammar
 (exp  [(NUMBER) $1]
       [(exp exp ADD) (+ $1 $2)]
       [(exp exp SUBTRACT) (- $1 $2)]
       [(exp exp PRODUCT) (* $1 $2)]
       [(exp exp DIVISION) (/ $1 $2)]
       [(exp exp POWER) (expt $1 $2)]
       [(exp NEG) (- $1)]))

The rules have actions that compute the value of the expression in terms of the value of its parts. For example, in the rule for addition, $1 refers to the first component exp and $2 refers to the second one.

2.3.2 The other components of parser

(start exp)

Starting point for the parser. In our example there is only one grouping exp, therefore exp will be the starting grouping.

(end EOF)

End point for the parser. The parser will stop when it reads token EOF.

(tokens value-tokens op-tokens)

Declares all of the tokens that can be used by the parser in the grammar declaration. Tokens is defined in the lexer.

(src-pos)

Causes the generated parser to expect input in the form of token with source information, instead of only tokens.

(error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void))))

The function which will be executed for its side-effect whenever the parser encounters an error.

2.3.3 parse function
(define (parse ip)
  (port-count-lines! ip)
  (myparser (lambda () (next-token ip))))

Wrapper around the parser. It handles the call to the lexer.

Necessary for having source code information in error message.

2.3.4 Testing the parser

Examples:
> (parse (open-input-string "20 3 5 * 7 + + "))

42

> (parse (open-input-string "2 4 ^ n "))

-16

> (parse (open-input-string "1 2 3 / + * "))

a = #t

b = PRODUCT

c = #f

d = #(struct:position 11 1 10)

e = #(struct:position 12 1 11)

parser: Cannot continue after error

2.4 Language

We wrap it up with making a lex-yacc-example/rpcalc language.

2.4.1 Reader
#lang racket
 
(require "parser.rkt" )
 
(provide (rename-out [my-read read]
                     [my-read-syntax read-syntax]))
 
(define (my-read in)
  (syntax->datum
   (my-read-syntax #f in)))
 
(define (my-read-syntax path port)
  (datum->syntax
   #f
   `(module rpcalc-mod racket
      ,(parse port))))
 
2.4.2 lex-yacc-example/rpcalc

The file "rpcalc.rkt" in the top-level directory enables #lang lex-yacc-example/rpcalc

#lang racket
 
(module reader racket
  (require "rpcalc/reader.rkt")
  (provide read read-syntax))
 
2.4.3 Main

Optionally, going down to directory rpcalc, and running "raco pkg install" will enable #lang rpcalc

The file "main.rkt":

#lang racket
 
(module reader racket
  (require "reader.rkt")
  (provide read read-syntax))
 
2.4.4 Running rpcalc

File "rpcalc-test.rkt":

#lang lex-yacc-example/rpcalc
 
2 3 4 5 + + ^ n
 

Example:
> (require lex-yacc-example/rpcalc/rpcalc-test)

-4096

3 Infix Notation Calculator

Based on GNU Bison Infix Calc example.

https://www.gnu.org/software/bison/manual/bison.html#Infix-Calc

We now modify rpcalc to handle infix operators instead of postfix. Infix notation involves the concept of operator precedence and the need for parentheses nested to arbitrary depth.

3.1 BNF

 

input

 ::= 

input line

 

line

 ::= 

\n

 

  |  

exp \n

 

exp

 ::= 

number

 

  |  

exp + exp

 

  |  

exp - exp

 

  |  

exp * exp

 

  |  

exp / exp

 

  |  

exp ^ exp

 

  |  

- exp

 

  |  

( exp )

3.2 Lexer

Below is the full code for the lexer.

#lang racket
 
(require parser-tools/lex
         (prefix-in : parser-tools/lex-sre))
 
(define-tokens value-tokens (NUMBER))
(define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER NEG OP CP NEWLINE))
 
(define next-token
  (lexer-src-pos
   [(eof) (token-EOF)]
   [(:+ (:& (:~ #\newline) whitespace)) (return-without-pos (next-token input-port))]   
   [#\+ (token-ADD)]
   [#\- (token-SUBTRACT)]
   [#\* (token-PRODUCT)]
   [#\/ (token-DIVISION)]      
   [#\^ (token-POWER)]
   ;[#\n (token-NEG)]
   [#\( (token-OP)]
   [#\) (token-CP)]
   [#\newline (token-NEWLINE)]
   [(:: (:+ numeric) (:* (:: #\. (:+ numeric) ))) (token-NUMBER (string->number lexeme))]))
 
(provide value-tokens op-tokens next-token)
 

Changes from rpcalc: Newline is a token, whitespace without newline, and token for ’(’ and ’)’. Neg will not be used in lexer, but is defined because of use in parser later on.

3.3 Parser

Below is the full code for the parser.

#lang racket
 
(require parser-tools/yacc "lexer.rkt")
 
 
(define myparser
  (parser
 
   (start input)
   (end EOF)
   (tokens value-tokens op-tokens )
   (src-pos)
   (error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void))))   
 
   (precs
    (left ADD SUBTRACT)
    (left PRODUCT DIVISION)
    (nonassoc NEG)
    (right POWER))    
   
   (grammar
 
    (input [() '()]           
           [(input line) (append $1  $2)])    
 
    (line [(NEWLINE) '()]
          [(exp NEWLINE) (list $1)])
 
    (exp  [(NUMBER) $1]
          [(exp ADD exp) (+ $1 $3)]
          [(exp SUBTRACT exp) (- $1 $3)]
          [(exp PRODUCT exp) (* $1 $3)]
          [(exp DIVISION exp) (/ $1 $3)]
          [(SUBTRACT exp) (prec NEG) (- $2)]
          [(exp POWER exp) (expt $1 $3)]
          [(OP exp CP) $2]))))
          
             
(define (parse ip)
  (port-count-lines! ip)
  (myparser (lambda () (next-token ip))))
    
 
(provide parse )
 

There are two important new features shown in this code.

In the precs section, left declares token kinds and says they are left-associative operators. And right (right associativity).

Operator precedence is determined by the line ordering of the declarations. The higher the line number of the declaration (lower on the page or screen), the higher the precedence. Hence, exponentiation has the highest precedence, unary minus (NEG) is next, followed by ‘*’ and ‘/’, and so on. Unary minus is not associative, only precedence matters.

The other important new feature is the prec in the grammar section for the unary minus operator. The prec simply instructs Yacc that the rule (SUBTRACT exp) has the same precedence as NEG. In this case the next-to-highest.

3.3.1 Testing the parser

Examples:
> (parse (open-input-string "\n\n1 + 4*8 \n 6/10\n\n\n 5 + 6 +7 \n\n\n"))

'(33 3/5 18)

> (parse (open-input-string "\n\n(1 + 4)*8 \n 6/10\n\n\n 5 + 6 +7 \n\n\n"))

'(40 3/5 18)

> (parse (open-input-string "1 + 2^4 \n"))

'(17)

3.4 Language

We wrap it up with making a calc language.

3.4.1 Reader
#lang racket
 
(require "parser.rkt" )
 
(provide (rename-out [my-read read]
                     [my-read-syntax read-syntax]))
 
(define (my-read in)
  (syntax->datum
   (my-read-syntax #f in)))
 
(define (my-read-syntax path port)
  (datum->syntax
   #f
   `(module rpcalc-mod racket
      ',(parse port))))
 

Note the quote at: ',(parse port)

3.4.2 Main
#lang racket
 
(module reader racket
  (require "reader.rkt")
  (provide read read-syntax))
 
3.4.3 Running calc
 
#lang lex-yacc-example/calc
 
1 + 4 * 8
 
(1 + 4) * 8
 
6/10
 
2 ^ 4 + 100
 

The result should be '(33 40 3/5 116).

4 Multi-Function Calculator

Based on GNU Bison Multi-Function Calc example.

https://www.gnu.org/software/bison/manual/bison.html#Multi_002dfunction-Calc

Now that the basics of lexer and yacc have been discussed, it is time to move on to a more advanced problem. The above calculators provided only five functions, ‘+’, ‘-’, ‘*’, ‘/’ and ‘^’. It would be nice to have a calculator that provides other mathematical functions such as sin, cos, etc.

It is easy to add new operators to the infix calculator as long as they are only single-character literals. The lexer passes back all nonnumeric characters as tokens, so new grammar rules suffice for adding a new operator. But we want something more flexible: built-in functions whose syntax has this form:
function_name (argument)
At the same time, we will add memory to the calculator, by allowing you to create named variables, store values in them, and use them later.

4.1 BNF

 

input

 ::= 

input line

 

line

 ::= 

\n

 

  |  

exp \n

 

  |  

var = exp \n

 

exp

 ::= 

number

 

  |  

var

 

  |  

fun ( exp )

 

  |  

exp + exp

 

  |  

exp - exp

 

  |  

exp * exp

 

  |  

exp / exp

 

  |  

exp ^ exp

 

  |  

- exp

 

  |  

( exp )

4.2 Lexer

Below is the full code for the lexer.

#lang racket
 
(require parser-tools/lex
         (prefix-in : parser-tools/lex-sre)
         "funs.rkt")
 
(define-tokens value-tokens (NUMBER VAR FUN))
(define-empty-tokens op-tokens (EOF ADD SUBTRACT PRODUCT DIVISION POWER OP CP EQ NEG NEWLINE ))
 
(define next-token
  (lexer-src-pos
   [(eof) (token-EOF)]
   [(:+ (:& (:~ #\newline) whitespace)) (return-without-pos (next-token input-port))] 
   [#\+ (token-ADD)]
   [#\- (token-SUBTRACT)]
   [#\* (token-PRODUCT)]
   [#\/ (token-DIVISION)]      
   [#\^ (token-POWER)]
   [#\( (token-OP)]
   [#\) (token-CP)]
   [#\= (token-EQ)]
   [#\newline (token-NEWLINE)]
   [(:: (:+ numeric) (:* (:: #\. (:+ numeric) ))) (token-NUMBER (string->number lexeme))]
   [(:: alphabetic (:* (:or alphabetic numeric))) (let ([sym (string->symbol lexeme)])
                                                    (if (fun? sym)
                                                        (token-FUN sym)
                                                        (token-VAR sym)))]))                                                  
   
 
(provide value-tokens op-tokens next-token)
 

The lexer has to decide between FUN or VAR token. This is done by query if function is defined. The function fun? gives the answer. More about function and variable in next section.

4.3 funs.rkt

The new file where function and variable is handled.

#lang racket
 
(provide get-fun fun? get-var set!-var )
 
(define funs
  (hasheq 'atan atan
          'cos cos
          'exp expt
          'ln log
          'sin sin
          'sqrt sqrt))
 
(define vars (make-hash))
 
(define (fun? key)
  (hash-has-key? funs key))
  
(define (var? key)
  (hash-has-key? vars key))
 
(define (get-fun key)
  (if (fun? key)
      (hash-ref funs key)
      (error "fun: no such function. " key)))
  
(define (get-var key)
  (if (var? key)
      (hash-ref vars key)
      (error "var: no such variable. " key)))
 
(define (set!-var key val)
  (hash-set! vars key val))
 

Function is stored in immutable hash-table.

Variable is stored in mutable hash-table, the memory where variable’s is defined and changed.

4.4 Parser

Below is the full code for the parser.

#lang racket
 
(require parser-tools/yacc "lexer.rkt") 
 
(define myparser
  (parser
 
   (start input)
   (end EOF)
   (tokens value-tokens op-tokens )
   (src-pos)
   (error (lambda (a b c d e) (begin (printf "a = ~a\nb = ~a\nc = ~a\nd = ~a\ne = ~a\n" a b c d e) (void))))   
 
   (precs    
    (left ADD SUBTRACT)
    (left PRODUCT DIVISION)
    (nonassoc NEG)
    (left POWER))
   
   (grammar
 
    (input [() '()]
           [(input line) (append $1 $2)])
 
    (line [(NEWLINE) '()]
          [(exp NEWLINE) (list $1)]
          [(VAR EQ exp NEWLINE)  `((assign ',$1 ,$3))])
 
    (exp  [(NUMBER) $1]
          [(VAR) `(var ',$1)]          
          [(FUN OP exp CP) `(fun ',$1 ,$3)]
          [(exp ADD exp) `(add ,$1 ,$3)]
          [(exp SUBTRACT exp) `(subtract ,$1 ,$3)]
          [(exp PRODUCT exp) `(product ,$1 ,$3)]
          [(exp DIVISION exp) `(division ,$1 ,$3)]
          [(SUBTRACT exp) (prec NEG) `(neg ,$2)]
          [(exp POWER exp) `(power ,$1 ,$3)]
          [(OP exp CP) $2]))))
          
             
(define (parse ip)
  (port-count-lines! ip)  
  (myparser (lambda () (next-token ip))))
    
 
(provide parse )
 

The parser generate s-expression’s for input to the expander. This is an important step forward. When program’s becomes more complicated it is better to handle this separate from the parser. In Racket it is common that parsing is in the reader and runnable code is in the expander.

4.5 Language

We will making a mfcalc language, using reader and expander.

4.5.1 Expander
#lang racket
 
(require (for-syntax syntax/parse) "funs.rkt")
 
(provide 
 (rename-out [module-begin #%module-begin])
 #%top-interaction
 #%app
 #%datum
 quote
 add subtract product division power neg fun assign var)
         
(define-syntax (module-begin stx) 
  (syntax-parse stx
      [(module-begin expr ...)       
       #'(#%module-begin
          expr ...)]))  
 
(define (add x y)
  (+ x y))
 
(define (subtract x y)
  (- x y))
 
(define (product x y)
  (* x y))
 
(define (division x y)
  (/ x y))
 
(define (power x y)
  (expt x y))
 
(define (neg x)
  (- x))
 
(define (fun name x)
  ((get-fun name) x))
 
(define (assign varname value)
  (begin
    (set!-var varname value)
    value))
 
(define (var name)
  (get-var name))
 
4.5.2 Testing s-exp

The s-exp version of mfcalc can be tested by using the #lang s-exp declaration

#lang s-exp "expander.rkt"
 
(add 1 2)
(subtract (add 3 4) 5)
(product (power 2 4) 10)
(division (product (power 2 4) 10) 5)
(product 100 (fun 'sqrt 25))
(assign 'x 50)
(division (var 'x) (add 3 7))
(assign 'y 20)
(add (var 'x) (var 'y))
(assign 'x 60)
(add (var 'x) (var 'y))
 

Example:
> (require lex-yacc-example/mfcalc/s-exp-test)

3

2

160

32

500

50

5

20

70

60

80

4.5.3 Reader

The reader module are using "expander.rkt".

#lang racket
 
(require "parser.rkt" )
 
(provide (rename-out [my-read read]
                     [my-read-syntax read-syntax]))
 
(define (my-read in)
  (syntax->datum
   (my-read-syntax #f in)))
 
(define (my-read-syntax path port)
  (datum->syntax
   #f
   `(module rpcalc-mod "expander.rkt"
      ,@(parse port))))
 
4.5.4 Main
#lang racket
 
(module reader racket
  (require "reader.rkt")
  (provide read read-syntax ))
 
4.5.5 Running mfcalc

File "mfcalc-test.rkt":

#lang lex-yacc-example/mfcalc
 
1 + 2
3 * 4 + 10
3 * (4 + 10)
100 + 2 * 5
-5 + 3.5
2^4 / 2
sqrt(25)
sin( 0.5)
sqrt(16) + sqrt(25)
x = 10 + 5
y = 2.5
x * 1000
x - y
ln(6)
 

Example:
> (require lex-yacc-example/mfcalc/mfcalc-test)

3

22

42

110

-1.5

8

5

0.479425538604203

9

15

2.5

15000

12.5

1.791759469228055

5 Conclusion

Congratulation! You are now a lexer and yacc ninja!

Some other sources:

The GNU Bison manual covers many topics of yacc/bison parser. Useful even if you can not program in C.

https://www.gnu.org/software/bison/manual/bison.html

The racket parser have two examples: calc.rkt and read.rkt

https://github.com/racket/parser-tools/tree/master/parser-tools-lib/parser-tools/examples