Popularity
7.5
Stable
Activity
7.0
-
587
21
61

Programming language: Rust
Tags: Parser    

rust-peg alternatives and similar packages

Based on the "Parser" category

Do you think we are missing an alternative of rust-peg or a related project?

Add another 'Parser' Package

README

Parsing Expression Grammars in Rust

rust-peg is a simple yet flexible parser generator based on the Parsing Expression Grammar formalism. It provides a Rust macro that builds a recursive descent parser from a concise definition of the grammar.

Please see the release notes for updates.

Note: This documentation corresponds to the upcoming 1.0-beta version. For the latest release (which is a build script rather than a procedural macro), see crates.io.

The peg!{} macro encloses a grammar definition containing a set of rules which match components of your language. It expands to a Rust mod containing functions corresponding to each rule marked pub.

use peg::peg;

peg! {
  grammar my_grammar() for str {
    rule number() -> u32
      = n:$(['0'..='9']+) { n.parse().unwrap() }

    pub rule list() -> Vec<u32>
      = "[" l:number() ** "," "]" { l }
  }
}

fn main() {
  assert_eq!(my_grammar::number("[1,1,2,3,5,8]"), vec![1,1,2,3,5,8]);
}

Expressions

  • "keyword" - Literal: match a literal string.
  • ['0'..='9'] - Pattern: match a single element that matches a Rust match-style pattern. (details)
  • some_rule() - Rule: match a rule defined elsewhere in the grammar and return its result.
  • e1 e2 e3 - Sequence: match expressions in sequence (e1 followed by e2 followed by e3).
  • e1 / e2 / e3 - Ordered choice: try to match e1. If the match succeeds, return its result, otherwise try e2, and so on.
  • expression? - Optional: match one or zero repetitions of expression. Returns an Option.
  • expression* - Repeat: match zero or more repetitions of expression and return the results as a Vec.
  • expression+ - One-or-more: match one or more repetitions of expression and return the results as a Vec.
  • expression*<n,m> - Range repeat: match between n and m repetitions of expression return the results as a Vec. (details)
  • expression ** delim - Delimited repeat: match zero or more repetitions of expression delimited with delim and return the results as a Vec.
  • &expression - Positive lookahead: Match only if expression matches at this position, without consuming any characters.
  • !expression - Negative lookahead: Match only if expression does not match at this position, without consuming any characters.
  • a:e1 b:e2 c:e3 { rust } - Action: Match e1, e2, e3 in sequence. If they match successfully, run the Rust code in the block and return its return value. The variable names before the colons in the preceding sequence are bound to the results of the corresponding expressions.
  • a:e1 b:e2 c:e3 {? rust } - Like above, but the Rust block returns a Result<T, &str> instead of a value directly. On Ok(v), it matches successfully and returns v. On Err(e), the match of the entire expression fails and it tries alternatives or reports a parse error with the &str e.
  • $(e) - Slice: match the expression e, and return the &str slice of the input corresponding to the match.
  • position!() - return a usize representing the current offset into the input, and consumes no characters.
  • quiet!{ e } - match expression, but don't report literals within it as "expected" in error messages.
  • expected!("something") - fail to match, and report the specified string as an expected symbol at the current location.
  • precedence!{ ... } - Parse infix, prefix, or postfix expressions by precedence climbing. (details)

Match expressions

The [pat] syntax expands into a Rust match pattern against the next character (or element) of the input.

This is commonly used for matching sets of characters with Rust's ..= inclusive range pattern syntax and | to match multiple patterns. For example ['a'..='z' | 'A'..='Z'] matches an upper or lower case ASCII alphabet character.

If your input type is a slice of an enum type, a pattern could match an enum variant like [Token::Operator('+')] or even bind a variable with [Token::Identifier(i)].

[_] matches any single element. As this always matches except at end-of-file, combining it with negative lookahead as ![_] is the idiom for matching EOF in PEG.

Repeat ranges

The repeat operators * and ** can be followed by an optional range specification of the form <n> (exact), <n,> (min), <,m> (max) or <n,m> (range), where n and m are either integers, or a Rust usize expression enclosed in {}.

Precedence climbing

precedence!{ rules... } provides a convenient way to parse infix, prefix, and postfix operators using the precedence climbing algorithm.

pub rule arithmetic -> i64 = precedence!{
  x:(@) "+" y:@ { x + y }
  x:(@) "-" y:@ { x - y }
  --
  x:(@) "*" y:@ { x * y }
  x:(@) "/" y:@ { x / y }
  --
  x:@ "^" y:(@) { x.pow(y as u32) }
  --
  n:number { n }
}

Each -- introduces a new precedence level that binds more tightly than previous precedence levels. The levels consist of one or more operator rules each followed by a Rust action expression.

The (@) and @ are the operands, and the parentheses indicate associativity. An operator rule beginning and ending with @ is an infix expression. Prefix and postfix rules have one @ at the beginning or end, and atoms do not include @.

Custom input types

rust-peg handles input types through a series of traits, and comes with implementations for str, [u8], and [T].

  • Parse is the base trait for all inputs. The others are only required to use the corresponding expressions.
  • ParseElem implements the [_] pattern operator, with a method returning the next item of the input to match.
  • ParseLiteral implements matching against a "string" literal.
  • ParseSlice implements the $() operator, returning a slice from a span of indexes.

Error reporting

When a match fails, position information is automatically recorded to report a set of "expected" tokens that would have allowed the parser to advance further.

Some rules should never appear in error messages, and can be suppressed with quiet!{e}:

rule whitespace() = quiet!{[' ' | '\n' | '\t']+}

If you want the "expected" set to contain a more helpful string instead of character sets, you can use quiet!{} and expected!() together:

rule identifier()
  = quiet!{[ 'a'..='z' | 'A'..='Z']['a'..='z' | 'A'..='Z' | '0'..='9' ]+}
  / expected!("identifier")

Imports

use super::name;

The grammar may begin with a series of use declarations, just like in Rust, which are included in the generated module. Unlike normal mod {} blocks, use super::* is inserted by default, so you don't have to deal with this most of the time.

Rustdoc comments

rustdoc comments with /// before a grammar or pub rule are propagated to the resulting function:

/// Parse an array expression.
pub rule array() -> Vec<Expr> = ...

As with all procedural macros, non-doc comments are ignored by the lexer and can be used like in any other Rust code.

Tracing

If you pass the peg/trace feature to Cargo when building your project, a trace of the parsing will be output to stdout when running the binary. For example,

$ cargo run --features peg/trace
...
[PEG_TRACE] Matched rule type at 8:5
[PEG_TRACE] Attempting to match rule ident at 8:12
[PEG_TRACE] Attempting to match rule letter at 8:12
[PEG_TRACE] Failed to match rule letter at 8:12
...