Skip to content

funcparserlib.parser — Functional parsing combinators

funcparserlib.parser

Functional parsing combinators.

Parsing combinators define an internal domain-specific language (DSL) for describing the parsing rules of a grammar. The DSL allows you to start with a few primitive parsers, then combine your parsers to get more complex ones, and finally cover the whole grammar you want to parse.

The structure of the language:

  • Class Parser
    • All the primitives and combinators of the language return Parser objects
    • It defines the main Parser.parse(tokens) method
  • Primitive parsers
    • tok(type, value), a(value), some(pred), forward_decl(), finished
  • Parser combinators
    • p1 + p2, p1 | p2, p >> f, -p, maybe(p), many(p), oneplus(p), skip(p)
  • Abstraction
    • Use regular Python variables p = ... # Expression of type Parser to define new rules (non-terminals) of your grammar

Every time you apply one of the combinators, you get a new Parser object. In other words, the set of Parser objects is closed under the means of combination.

Note

We took the parsing combinators language from the book Introduction to Functional Programming and translated it from ML into Python.

funcparserlib.parser.Parser

Bases: Generic[_A, _B]

A parser object that can parse a sequence of tokens or can be combined with other parsers using +, |, >>, many(), and other parsing combinators.

Type: Parser[A, B]

The generic variables in the type are: A — the type of the tokens in the sequence to parse,B — the type of the parsed value.

In order to define a parser for your grammar:

  1. You start with primitive parsers by calling a(value), some(pred), forward_decl(), finished
  2. You use parsing combinators p1 + p2, p1 | p2, p >> f, many(p), and others to combine parsers into a more complex parser
  3. You can assign complex parsers to variables to define names that correspond to the rules of your grammar

Note

The constructor Parser.__init__() is considered internal and may be changed in future versions. Use primitive parsers and parsing combinators to construct new parsers.

funcparserlib.parser.Parser.parse(tokens)

Parse the sequence of tokens and return the parsed value.

Type: (Sequence[A]) -> B

It takes a sequence of tokens of arbitrary type A and returns the parsed value of arbitrary type B.

If the parser fails to parse the tokens, it raises NoParseError.

Note

Although Parser.parse() can parse sequences of any objects (including str which is a sequence of str chars), the recommended way is parsing sequences of Token objects.

You should use a regexp-based tokenizer make_tokenizer() defined in funcparserlib.lexer to convert your text into a sequence of Token objects before parsing it. You will get more readable parsing error messages (as Token objects contain their position in the source file) and good separation of the lexical and syntactic levels of the grammar.

funcparserlib.parser.Parser.define(p)

Define the parser created earlier as a forward declaration.

Type: (Parser[A, B]) -> None

Use p = forward_decl() in combination with p.define(...) to define recursive parsers.

See the examples in the docs for forward_decl().

funcparserlib.parser.Parser.named(name)

Specify the name of the parser for easier debugging.

Type: (str) -> Parser[A, B]

This name is used in the debug-level parsing log. You can also get it via the Parser.name attribute.

Examples:

>>> expr = (a("x") + a("y")).named("expr")
>>> expr.name
'expr'
>>> expr = a("x") + a("y")
>>> expr.name
"('x', 'y')"

Note

You can enable the parsing log this way:

import logging
logging.basicConfig(level=logging.DEBUG)
import funcparserlib.parser
funcparserlib.parser.debug = True

The way to enable the parsing log may be changed in future versions.

Primitive Parsers

funcparserlib.parser.tok(type, value=None)

Return a parser that parses a Token and returns the string value of the token.

Type: (str, Optional[str]) -> Parser[Token, str]

You can match any token of the specified type or you can match a specific token by its type and value.

Examples:

>>> expr = tok("expr")
>>> expr.parse([Token("expr", "foo")])
'foo'
>>> expr.parse([Token("expr", "bar")])
'bar'
>>> expr.parse([Token("op", "=")])
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: '=', expected: expr
>>> expr = tok("op", "=")
>>> expr.parse([Token("op", "=")])
'='
>>> expr.parse([Token("op", "+")])
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: '+', expected: '='

Note

In order to convert your text to parse into a sequence of Token objects, use a regexp-based tokenizer make_tokenizer() defined in funcparserlib.lexer. You will get more readable parsing error messages (as Token objects contain their position in the source file) and good separation of the lexical and syntactic levels of the grammar.

funcparserlib.parser.a(value)

Return a parser that parses a token if it's equal to value.

Type: (A) -> Parser[A, A]

Examples:

>>> expr = a("x")
>>> expr.parse("x")
'x'
>>> expr.parse("y")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: 'y', expected: 'x'

Note

Although Parser.parse() can parse sequences of any objects (including str which is a sequence of str chars), the recommended way is parsing sequences of Token objects.

You should use a regexp-based tokenizer make_tokenizer() defined in funcparserlib.lexer to convert your text into a sequence of Token objects before parsing it. You will get more readable parsing error messages (as Token objects contain their position in the source file) and good separation of the lexical and syntactic levels of the grammar.

funcparserlib.parser.some(pred)

Return a parser that parses a token if it satisfies the predicate pred.

Type: (Callable[[A], bool]) -> Parser[A, A]

Examples:

>>> expr = some(lambda s: s.isalpha()).named('alpha')
>>> expr.parse("x")
'x'
>>> expr.parse("y")
'y'
>>> expr.parse("1")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: '1', expected: alpha

Warning

The some() combinator is quite slow and may be changed or removed in future versions. If you need a parser for a token by its type (e.g. any identifier) and maybe its value, use tok(type[, value]) instead. You should use make_tokenizer() from funcparserlib.lexer to tokenize your text first.

funcparserlib.parser.forward_decl()

Return an undefined parser that can be used as a forward declaration.

Type: Parser[Any, Any]

Use p = forward_decl() in combination with p.define(...) to define recursive parsers.

Examples:

>>> expr = forward_decl()
>>> expr.define(a("x") + maybe(expr) + a("y"))
>>> expr.parse("xxyy")  # noqa
('x', ('x', None, 'y'), 'y')
>>> expr.parse("xxy")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected end of input, expected: 'y'

Note

If you care about static types, you should add a type hint for your forward declaration, so that your type checker can check types in p.define(...) later:

p: Parser[str, int] = forward_decl()
p.define(a("x"))  # Type checker error
p.define(a("1") >> int)  # OK

finished

A parser that throws an exception if there are any unparsed tokens left in the sequence.

Type: Parser[Any, None]

Examples:

>>> from funcparserlib.parser import a, finished
>>> expr = a("x") + finished
>>> expr.parse("x")
('x', None)
>>> expr = a("x") + finished
>>> expr.parse("xy")
Traceback (most recent call last):
    ...
funcparserlib.parser.NoParseError: got unexpected token: 'y', expected: end of input

Parser Combinators

funcparserlib.parser.Parser.__add__(other)

Sequential combination of parsers. It runs this parser, then the other parser.

The return value of the resulting parser is a tuple of each parsed value in the sum of parsers. We merge all parsing results of p1 + p2 + ... + pN into a single tuple. It means that the parsing result may be a 2-tuple, a 3-tuple, a 4-tuple, etc. of parsed values. You avoid this by transforming the parsed pair into a new value using the >> combinator.

You can also skip some parsing results in the resulting parsers by using -p or skip(p) for some parsers in your sum of parsers. It means that the parsing result might be a single value, not a tuple of parsed values. See the docs for Parser.__neg__() for more examples.

Overloaded types (lots of them to provide stricter checking for the quite dynamic return type of this method):

  • (self: Parser[A, B], _IgnoredParser[A]) -> Parser[A, B]
  • (self: Parser[A, B], Parser[A, C]) -> _TupleParser[A, Tuple[B, C]]
  • (self: _TupleParser[A, B], _IgnoredParser[A]) -> _TupleParser[A, B]
  • (self: _TupleParser[A, B], Parser[A, Any]) -> Parser[A, Any]
  • (self: _IgnoredParser[A], _IgnoredParser[A]) -> _IgnoredParser[A]
  • (self: _IgnoredParser[A], Parser[A, C]) -> Parser[A, C]

Examples:

>>> expr = a("x") + a("y")
>>> expr.parse("xy")
('x', 'y')
>>> expr = a("x") + a("y") + a("z")
>>> expr.parse("xyz")
('x', 'y', 'z')
>>> expr = a("x") + a("y")
>>> expr.parse("xz")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: 'z', expected: 'y'

funcparserlib.parser.Parser.__neg__()

Return a parser that parses the same tokens, but its parsing result is ignored by the sequential + combinator.

Type: (Parser[A, B]) -> _IgnoredParser[A]

You can use it for throwing away elements of concrete syntax (e.g. ",", ";").

Examples:

>>> expr = -a("x") + a("y")
>>> expr.parse("xy")
'y'
>>> expr = a("x") + -a("y")
>>> expr.parse("xy")
'x'
>>> expr = a("x") + -a("y") + a("z")
>>> expr.parse("xyz")
('x', 'z')
>>> expr = -a("x") + a("y") + -a("z")
>>> expr.parse("xyz")
'y'
>>> expr = -a("x") + a("y")
>>> expr.parse("yz")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: 'y', expected: 'x'
>>> expr = a("x") + -a("y")
>>> expr.parse("xz")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: 'z', expected: 'y'

Note

You should not pass the resulting parser to any combinators other than +. You should have at least one non-skipped value in your p1 + p2 + ... + pN. The parsed value of -p is an internal _Ignored object, not intended for actual use.

funcparserlib.parser.Parser.__or__(other)

Choice combination of parsers.

It runs this parser and returns its result. If the parser fails, it runs the other parser.

Examples:

>>> expr = a("x") | a("y")
>>> expr.parse("x")
'x'
>>> expr.parse("y")
'y'
>>> expr.parse("z")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: 'z', expected: 'x' or 'y'

funcparserlib.parser.Parser.__rshift__(f)

Transform the parsing result by applying the specified function.

Type: (Callable[[B], C]) -> Parser[A, C]

You can use it for transforming the parsed value into another value before including it into the parse tree (the AST).

Examples:

>>> def make_canonical_name(s):
...     return s.lower()
>>> expr = (a("D") | a("d")) >> make_canonical_name
>>> expr.parse("D")
'd'
>>> expr.parse("d")
'd'

funcparserlib.parser.maybe(p)

Return a parser that returns None if the parser p fails.

Examples:

>>> expr = maybe(a("x"))
>>> expr.parse("x")
'x'
>>> expr.parse("y") is None
True

funcparserlib.parser.many(p)

Return a parser that applies the parser p as many times as it succeeds at parsing the tokens.

Return a parser that infinitely applies the parser p to the input sequence of tokens as long as it successfully parses them. The parsed value is a list of the sequentially parsed values.

Examples:

>>> expr = many(a("x"))
>>> expr.parse("x")
['x']
>>> expr.parse("xx")
['x', 'x']
>>> expr.parse("xxxy")  # noqa
['x', 'x', 'x']
>>> expr.parse("y")
[]

funcparserlib.parser.oneplus(p)

Return a parser that applies the parser p one or more times.

A similar parser combinator many(p) means apply p zero or more times, whereas oneplus(p) means apply p one or more times.

Examples:

>>> expr = oneplus(a("x"))
>>> expr.parse("x")
['x']
>>> expr.parse("xx")
['x', 'x']
>>> expr.parse("y")
Traceback (most recent call last):
    ...
parser.NoParseError: got unexpected token: 'y', expected: 'x'

funcparserlib.parser.skip(p)

An alias for -p.

See also the docs for Parser.__neg__().

Extra: Parser Monad

As a functional programmer, you might be pleased to know, that parsers in funcparserlib form a monad with Parser.bind() as >>= and pure() as return.

We could have expressed other parsing combinators in terms of bind(), but would be inefficient in Python:

# noinspection PyUnresolvedReferences
class Parser:
    def __add__(self, other):
        return self.bind(lambda x: other.bind(lambda y: pure((x, y))))

    def __rshift__(self, other):
        return self.bind(lambda x: pure(x))

funcparserlib.parser.Parser.bind(f)

Bind the parser to a monadic function that returns a new parser.

Type: (Callable[[B], Parser[A, C]]) -> Parser[A, C]

Also known as >>= in Haskell.

Note

You can parse any context-free grammar without resorting to bind. Due to its poor performance please use it only when you really need it.

funcparserlib.parser.pure(x)

Wrap any object into a parser.

Type: (A) -> Parser[A, A]

A pure parser doesn't touch the tokens sequence, it just returns its pure x value.

Also known as return in Haskell.