A new Left Recursive PEG Parser Generator for rust

I’d like to introduce a new PEG parser generator for rust, called lrpeg. I’ve recently started working on this. This PEG parser generator allows left recursion which makes PEG grammars much more useful.

When you want to use a parser generator with rust, there are a few options. First of all, there is lalrpop, which uses LR(1) grammars. This means there can only be one token lookahead, and no backtracking. This can make it very hard to write some grammars; often you’ll be faced with the dreaded shift-reduce errors. The Solang Solidity Compiler uses an lalrpop grammar. The grammar contains all sorts of tricks to avoid shift-reduce errors, which make the grammar totally unreadable. I’ve read Parsing Techniques by Dick Grune just to fully grok what can be done with LR(1) grammars. LR(1) complexity makes it hard to reason about.

The other option you have is pest. pest uses PEG grammars, without left recursion. This means that if you want to write a simple calculator, you have to re-write rules in a non-left recursive way, in other words the left-most rule in a definition cannot be the rule itself. The rule expr

1
2
3
4
num = { ASCII_DIGIT+ }
op = _{ "+" | "-" | "*" | "/" }
expr = { expr ~ op ~ term | term }
term = _{ num }

is not permitted because expr has itself as the most left most rule. So you have to re-write this as:

1
2
3
4
num = { ASCII_DIGIT+ }
op = _{ "+" | "-" | "*" | "/" }
expr = { term ~ (op ~ term)* }
term = _{ num }

This grammar still is not complete yet. Operator precedence, i.e. 1+2*4 should be parsed as 1+(2*4), not (1+2)*4. The way pest deals with this is with the precedence climber. This is some code which modifies the parse tree after parsing to adjust the precedence of operators. It is difficult to follow and this is not expressed in the grammar itself.

Another option, although not available in rust right now, are GLR grammars which ANTLR and GNU Bison use. Essentially that’s LR(1) with backtracking, and that makes for a powerful parser generator. However, implementing GLR is quite tricky. GLR parsers and grammars are pretty tricky to understand, there even more so than LR(1). So I think that pest has the right idea by using PEG grammars. However, it is possible to write the parser generator to allow for left-recursive rules, like python now uses. So this is why I wrote lrpeg. lrpeg can deal with left recursion (and indirect left recursion, where a left recursion happens via another rule). With lrpeg parser generator the grammar for a calculator is simply:

1
2
3
4
5
6
7
8
9
10
expr <- expr "+" term
/ expr "-" term
/ term;

term <- term "*" num
/ term "/" num
/ "(" expr ")"
/ num;

num <- r"[0-9]+";

Note that the precedence of * over + is specified by the ordering of the terms; expr will first recurse to term and try * before fallback to +. A more complete example with more precedence is the peg grammar for IRP.

The lrpeg parser generator is a new code base. There are outstanding issues like it does not detect unreachable alternatives or infinite loops in the grammar (e.g. expr <- expr "a";). It is also not self-hosting yet, in other words it does not parse the peg grammar with peg, it actually still uses lalrpop for that. This is why rules must end with a “;”: LR(1) has only one token lookahead, and that is not enough to determine the end of a rule. On the other hand, it is a fully functional parser generator which can parse IRP.

PEG grammars offer another advantage over GLR grammars which I’m keen on. The Solidity language has a huge list of keywords. Keywords exist in programming language because GLR and LR(1) grammars have separate tokenizer or lexer, which handles keywords as a special case. The lexer does not have the context to know if a keyword is used as a variable name, so the parser just gives up. For example, in Solidity int seconds = 1; is not permitted. With PEG grammars, we can do away with all these problems. The lexer and the parser are in one big definition; so we can match seconds as a keyword only where it is expected; in other cases, it can be a variable name. This would mean an end to most keywords. That would be nice.