funcparserlib.parser
— Functional parsing combinators
funcparserlib.parser
Functional parsing combinators.
Parsing combinators define an internal domain-specific language (DSL) for describing the parsing rules of a grammar. The DSL allows you to start with a few primitive parsers, then combine your parsers to get more complex ones, and finally cover the whole grammar you want to parse.
The structure of the language:
- Class
Parser
- All the primitives and combinators of the language return
Parser
objects - It defines the main
Parser.parse(tokens)
method
- All the primitives and combinators of the language return
- Primitive parsers
tok(type, value)
,a(value)
,some(pred)
,forward_decl()
,finished
- Parser combinators
p1 + p2
,p1 | p2
,p >> f
,-p
,maybe(p)
,many(p)
,oneplus(p)
,skip(p)
- Abstraction
- Use regular Python variables
p = ... # Expression of type Parser
to define new rules (non-terminals) of your grammar
- Use regular Python variables
Every time you apply one of the combinators, you get a new Parser
object. In other
words, the set of Parser
objects is closed under the means of combination.
Note
We took the parsing combinators language from the book Introduction to Functional Programming and translated it from ML into Python.
funcparserlib.parser.Parser
Bases: Generic[_A, _B]
A parser object that can parse a sequence of tokens or can be combined with
other parsers using +
, |
, >>
, many()
, and other parsing combinators.
Type: Parser[A, B]
The generic variables in the type are: A
— the type of the tokens in the
sequence to parse,B
— the type of the parsed value.
In order to define a parser for your grammar:
- You start with primitive parsers by calling
a(value)
,some(pred)
,forward_decl()
,finished
- You use parsing combinators
p1 + p2
,p1 | p2
,p >> f
,many(p)
, and others to combine parsers into a more complex parser - You can assign complex parsers to variables to define names that correspond to the rules of your grammar
Note
The constructor Parser.__init__()
is considered internal and may be
changed in future versions. Use primitive parsers and parsing combinators to
construct new parsers.
funcparserlib.parser.Parser.parse(tokens)
Parse the sequence of tokens and return the parsed value.
Type: (Sequence[A]) -> B
It takes a sequence of tokens of arbitrary type A
and returns the parsed value
of arbitrary type B
.
If the parser fails to parse the tokens, it raises NoParseError
.
Note
Although Parser.parse()
can parse sequences of any objects (including
str
which is a sequence of str
chars), the recommended way is
parsing sequences of Token
objects.
You should use a regexp-based tokenizer make_tokenizer()
defined in
funcparserlib.lexer
to convert your text into a sequence of Token
objects before parsing it. You will get more readable parsing error messages
(as Token
objects contain their position in the source file) and good
separation of the lexical and syntactic levels of the grammar.
funcparserlib.parser.Parser.define(p)
Define the parser created earlier as a forward declaration.
Type: (Parser[A, B]) -> None
Use p = forward_decl()
in combination with p.define(...)
to define
recursive parsers.
See the examples in the docs for forward_decl()
.
funcparserlib.parser.Parser.named(name)
Specify the name of the parser for easier debugging.
Type: (str) -> Parser[A, B]
This name is used in the debug-level parsing log. You can also get it via the
Parser.name
attribute.
Examples:
>>> expr = (a("x") + a("y")).named("expr")
>>> expr.name
'expr'
>>> expr = a("x") + a("y")
>>> expr.name
"('x', 'y')"
Note
You can enable the parsing log this way:
import logging
logging.basicConfig(level=logging.DEBUG)
import funcparserlib.parser
funcparserlib.parser.debug = True
The way to enable the parsing log may be changed in future versions.
Primitive Parsers
funcparserlib.parser.tok(type, value=None)
Return a parser that parses a Token
and returns the string value of the token.
Type: (str, Optional[str]) -> Parser[Token, str]
You can match any token of the specified type
or you can match a specific token by
its type
and value
.
Examples:
>>> expr = tok("expr")
>>> expr.parse([Token("expr", "foo")])
'foo'
>>> expr.parse([Token("expr", "bar")])
'bar'
>>> expr.parse([Token("op", "=")])
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: '=', expected: expr
>>> expr = tok("op", "=")
>>> expr.parse([Token("op", "=")])
'='
>>> expr.parse([Token("op", "+")])
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: '+', expected: '='
Note
In order to convert your text to parse into a sequence of Token
objects,
use a regexp-based tokenizer make_tokenizer()
defined in
funcparserlib.lexer
. You will get more readable parsing error messages (as
Token
objects contain their position in the source file) and good separation
of the lexical and syntactic levels of the grammar.
funcparserlib.parser.a(value)
Return a parser that parses a token if it's equal to value
.
Type: (A) -> Parser[A, A]
Examples:
>>> expr = a("x")
>>> expr.parse("x")
'x'
>>> expr.parse("y")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: 'y', expected: 'x'
Note
Although Parser.parse()
can parse sequences of any objects (including
str
which is a sequence of str
chars), the recommended way is
parsing sequences of Token
objects.
You should use a regexp-based tokenizer make_tokenizer()
defined in
funcparserlib.lexer
to convert your text into a sequence of Token
objects
before parsing it. You will get more readable parsing error messages (as Token
objects contain their position in the source file) and good separation of the
lexical and syntactic levels of the grammar.
funcparserlib.parser.some(pred)
Return a parser that parses a token if it satisfies the predicate pred
.
Type: (Callable[[A], bool]) -> Parser[A, A]
Examples:
>>> expr = some(lambda s: s.isalpha()).named('alpha')
>>> expr.parse("x")
'x'
>>> expr.parse("y")
'y'
>>> expr.parse("1")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: '1', expected: alpha
Warning
The some()
combinator is quite slow and may be changed or removed in future
versions. If you need a parser for a token by its type (e.g. any identifier)
and maybe its value, use tok(type[, value])
instead. You should use
make_tokenizer()
from funcparserlib.lexer
to tokenize your text first.
funcparserlib.parser.forward_decl()
Return an undefined parser that can be used as a forward declaration.
Type: Parser[Any, Any]
Use p = forward_decl()
in combination with p.define(...)
to define recursive
parsers.
Examples:
>>> expr = forward_decl()
>>> expr.define(a("x") + maybe(expr) + a("y"))
>>> expr.parse("xxyy") # noqa
('x', ('x', None, 'y'), 'y')
>>> expr.parse("xxy")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected end of input, expected: 'y'
Note
If you care about static types, you should add a type hint for your forward
declaration, so that your type checker can check types in p.define(...)
later:
p: Parser[str, int] = forward_decl()
p.define(a("x")) # Type checker error
p.define(a("1") >> int) # OK
finished
A parser that throws an exception if there are any unparsed tokens left in the sequence.
Type: Parser[Any, None]
Examples:
>>> from funcparserlib.parser import a, finished
>>> expr = a("x") + finished
>>> expr.parse("x")
('x', None)
>>> expr = a("x") + finished
>>> expr.parse("xy")
Traceback (most recent call last):
...
funcparserlib.parser.NoParseError: got unexpected token: 'y', expected: end of input
Parser Combinators
funcparserlib.parser.Parser.__add__(other)
Sequential combination of parsers. It runs this parser, then the other parser.
The return value of the resulting parser is a tuple of each parsed value in
the sum of parsers. We merge all parsing results of p1 + p2 + ... + pN
into a
single tuple. It means that the parsing result may be a 2-tuple, a 3-tuple,
a 4-tuple, etc. of parsed values. You avoid this by transforming the parsed
pair into a new value using the >>
combinator.
You can also skip some parsing results in the resulting parsers by using -p
or skip(p)
for some parsers in your sum of parsers. It means that the parsing
result might be a single value, not a tuple of parsed values. See the docs
for Parser.__neg__()
for more examples.
Overloaded types (lots of them to provide stricter checking for the quite dynamic return type of this method):
(self: Parser[A, B], _IgnoredParser[A]) -> Parser[A, B]
(self: Parser[A, B], Parser[A, C]) -> _TupleParser[A, Tuple[B, C]]
(self: _TupleParser[A, B], _IgnoredParser[A]) -> _TupleParser[A, B]
(self: _TupleParser[A, B], Parser[A, Any]) -> Parser[A, Any]
(self: _IgnoredParser[A], _IgnoredParser[A]) -> _IgnoredParser[A]
(self: _IgnoredParser[A], Parser[A, C]) -> Parser[A, C]
Examples:
>>> expr = a("x") + a("y")
>>> expr.parse("xy")
('x', 'y')
>>> expr = a("x") + a("y") + a("z")
>>> expr.parse("xyz")
('x', 'y', 'z')
>>> expr = a("x") + a("y")
>>> expr.parse("xz")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: 'z', expected: 'y'
funcparserlib.parser.Parser.__neg__()
Return a parser that parses the same tokens, but its parsing result is
ignored by the sequential +
combinator.
Type: (Parser[A, B]) -> _IgnoredParser[A]
You can use it for throwing away elements of concrete syntax (e.g. ","
,
";"
).
Examples:
>>> expr = -a("x") + a("y")
>>> expr.parse("xy")
'y'
>>> expr = a("x") + -a("y")
>>> expr.parse("xy")
'x'
>>> expr = a("x") + -a("y") + a("z")
>>> expr.parse("xyz")
('x', 'z')
>>> expr = -a("x") + a("y") + -a("z")
>>> expr.parse("xyz")
'y'
>>> expr = -a("x") + a("y")
>>> expr.parse("yz")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: 'y', expected: 'x'
>>> expr = a("x") + -a("y")
>>> expr.parse("xz")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: 'z', expected: 'y'
Note
You should not pass the resulting parser to any combinators other than
+
. You should have at least one non-skipped value in your
p1 + p2 + ... + pN
. The parsed value of -p
is an internal _Ignored
object, not intended for actual use.
funcparserlib.parser.Parser.__or__(other)
Choice combination of parsers.
It runs this parser and returns its result. If the parser fails, it runs the other parser.
Examples:
>>> expr = a("x") | a("y")
>>> expr.parse("x")
'x'
>>> expr.parse("y")
'y'
>>> expr.parse("z")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: 'z', expected: 'x' or 'y'
funcparserlib.parser.Parser.__rshift__(f)
Transform the parsing result by applying the specified function.
Type: (Callable[[B], C]) -> Parser[A, C]
You can use it for transforming the parsed value into another value before including it into the parse tree (the AST).
Examples:
>>> def make_canonical_name(s):
... return s.lower()
>>> expr = (a("D") | a("d")) >> make_canonical_name
>>> expr.parse("D")
'd'
>>> expr.parse("d")
'd'
funcparserlib.parser.maybe(p)
Return a parser that returns None
if the parser p
fails.
Examples:
>>> expr = maybe(a("x"))
>>> expr.parse("x")
'x'
>>> expr.parse("y") is None
True
funcparserlib.parser.many(p)
Return a parser that applies the parser p
as many times as it succeeds at
parsing the tokens.
Return a parser that infinitely applies the parser p
to the input sequence
of tokens as long as it successfully parses them. The parsed value is a list of
the sequentially parsed values.
Examples:
>>> expr = many(a("x"))
>>> expr.parse("x")
['x']
>>> expr.parse("xx")
['x', 'x']
>>> expr.parse("xxxy") # noqa
['x', 'x', 'x']
>>> expr.parse("y")
[]
funcparserlib.parser.oneplus(p)
Return a parser that applies the parser p
one or more times.
A similar parser combinator many(p)
means apply p
zero or more times, whereas
oneplus(p)
means apply p
one or more times.
Examples:
>>> expr = oneplus(a("x"))
>>> expr.parse("x")
['x']
>>> expr.parse("xx")
['x', 'x']
>>> expr.parse("y")
Traceback (most recent call last):
...
parser.NoParseError: got unexpected token: 'y', expected: 'x'
funcparserlib.parser.skip(p)
An alias for -p
.
See also the docs for Parser.__neg__()
.
Extra: Parser Monad
As a functional programmer, you might be pleased to know, that parsers in funcparserlib
form a monad with Parser.bind()
as >>=
and pure()
as return
.
We could have expressed other parsing combinators in terms of bind()
, but would be
inefficient in Python:
# noinspection PyUnresolvedReferences
class Parser:
def __add__(self, other):
return self.bind(lambda x: other.bind(lambda y: pure((x, y))))
def __rshift__(self, other):
return self.bind(lambda x: pure(x))
funcparserlib.parser.Parser.bind(f)
Bind the parser to a monadic function that returns a new parser.
Type: (Callable[[B], Parser[A, C]]) -> Parser[A, C]
Also known as >>=
in Haskell.
Note
You can parse any context-free grammar without resorting to bind
. Due
to its poor performance please use it only when you really need it.
funcparserlib.parser.pure(x)
Wrap any object into a parser.
Type: (A) -> Parser[A, A]
A pure parser doesn't touch the tokens sequence, it just returns its pure x
value.
Also known as return
in Haskell.