Preparing the Parse Tree
So far we have defined the parser for our calculator expressions language:
>>> from typing import List
>>> from funcparserlib.lexer import make_tokenizer, TokenSpec, Token
>>> from funcparserlib.parser import tok, Parser, many, forward_decl, finished
>>> def tokenize(s: str) -> List[Token]:
... specs = [
... TokenSpec("whitespace", r"\s+"),
... TokenSpec("float", r"[+\-]?\d+\.\d*([Ee][+\-]?\d+)*"),
... TokenSpec("int", r"[+\-]?\d+"),
... TokenSpec("op", r"(\*\*)|[+\-*/()]"),
... ]
... tokenizer = make_tokenizer(specs)
... return [t for t in tokenizer(s) if t.type != "whitespace"]
>>> def op(name: str) -> Parser[Token, str]:
... return tok("op", name)
>>> int_str = tok("int")
>>> float_str = tok("float")
>>> number = int_str | float_str
>>> expr = forward_decl()
>>> parenthesized = op("(") + expr + op(")")
>>> primary = number | parenthesized
>>> power = primary + many(op("**") + primary)
>>> expr.define(power)
>>> document = expr + finished
Here is how its parse results look so far:
>>> document.parse(tokenize("2 ** (3 ** 4)"))
('2', [('**', ('(', ('3', [('**', '4')]), ')'))], None)
p >> f
: Transforming Parse Results
Let's start improving our parse results by converting numbers from str
to int
or float
. We will use the Parser.__rshift__
combinator for that. p >> f
takes a parser p
and a function f
of a single argument and returns a new parser, that applies f
to the parse result of p
.
An integer parser that returns int
values:
>>> int_num: Parser[Token, int] = tok("int") >> int
Note
We specify the type hint for the parser only for clarity here. We wanted to highlight that >>
here changes the output type of the parser from str
to int
. You may omit type hints for parsers and rely on type inference features of your text editor and type checker to get code completion and linting warnings:
>>> int_num = tok("int") >> int
The only combinator which type is not inferrable is forward_decl()
. You should specify its type explicitly to get your parser fully type checked.
Try it:
>>> int_num.parse(tokenize("42"))
42
Let's redefine our number
parser so that it returns either int
or float
:
>>> from typing import Union
>>> float_num: Parser[Token, float] = tok("float") >> float
>>> number: Parser[Token, Union[int, float]] = int_num | float_num
Test it:
>>> number.parse(tokenize("42"))
42
>>> number.parse(tokenize("3.14"))
3.14
-p
: Skipping Parse Results
Let's recall our nested parenthesized numbers example:
>>> p = forward_decl()
>>> p.define(number | (op("(") + p + op(")")))
Test it:
>>> p.parse(tokenize("((1))"))
('(', ('(', 1, ')'), ')')
We have successfully parsed numbers in nested parentheses, but we don't want to see parentheses in the parsing results. Let's skip them using the Parser.__neg__
combinator. It allows you to skip any parts of a sequence of parsers concatenated via p1 + p2 + ... + pN
by using a unary -p
operator on the ones you want to skip:
>>> p = forward_decl()
>>> p.define(number | (-op("(") + p + -op(")")))
The result is cleaner now:
>>> p.parse(tokenize("1"))
1
>>> p.parse(tokenize("(1)"))
1
>>> p.parse(tokenize("((1))"))
1
Let's re-define our grammar using the Parser.__neg__
combinator to get rid of extra parentheses in the parse results, as well as of extra None
returned by finished
:
>>> expr = forward_decl()
>>> parenthesized = -op("(") + expr + -op(")")
>>> primary = number | parenthesized
>>> power = primary + many(op("**") + primary)
>>> expr.define(power)
>>> document = expr + -finished
Test it:
>>> document.parse(tokenize("2 ** (3 ** 4)"))
(2, [('**', (3, [('**', 4)]))])
User-Defined Classes for the Parse Tree
We have many types of binary operators in our grammar, but we've defined only the **
power operator so far. Let's define them for *
, /
, +
, -
as well:
>>> expr = forward_decl()
>>> parenthesized = -op("(") + expr + -op(")")
>>> primary = number | parenthesized
>>> power = primary + many(op("**") + primary)
>>> term = power + many((op("*") | op("/")) + power)
>>> sum = term + many((op("+") | op("-")) + term)
>>> expr.define(sum)
>>> document = expr + -finished
Here we've introduced a hierarchy of nested parsers: expr -> sum -> term -> power -> primary -> parenthesized -> expr -> ...
to reflect the order of calculations set by our operator priorities: +
< *
< **
< ()
.
Test it:
>>> document.parse(tokenize("1 * (2 + 0) ** 3"))
(1, [], [('*', (2, [], [], [('+', (0, [], []))], [('**', 3)]))], [])
It's hard to understand the results without proper user-defined classes for our expression types. We actually have 3 expression types:
- Integer numbers
- Floating point numbers
- Binary expressions
For integers and floats we will use Python int
and float
classes. For binary expressions we'll introduce the BinaryExpr
class:
>>> from dataclasses import dataclass
>>> @dataclass
... class BinaryExpr:
... op: str
... left: "Expr"
... right: "Expr"
Since we don't use a common base class for our expressions, we have to define Expr
as a union of possible expression types:
>>> Expr = Union[BinaryExpr, int, float]
Now let's define a function to transform the parse results of our binary operators into BinaryExpr
objects. Take a look at our parsers of various binary expressions. You can infer that each of them returns (expression, list of (operator, expression)). We will transform these nested tuples and lists into a tree of nested expressions by defining a function to_expr(args)
and applying >> to_expr
to our expression parsers:
>>> from typing import Tuple
>>> def to_expr(args: Tuple[Expr, List[Tuple[str, Expr]]]) -> Expr:
... first, rest = args
... result = first
... for op, expr in rest:
... result = BinaryExpr(op, result, expr)
... return result
Let's re-define our grammar using this transformation:
>>> expr: Parser[Token, Expr] = forward_decl()
>>> parenthesized = -op("(") + expr + -op(")")
>>> primary = number | parenthesized
>>> power = primary + many(op("**") + primary) >> to_expr
>>> term = power + many((op("*") | op("/")) + power) >> to_expr
>>> sum = term + many((op("+") | op("-")) + term) >> to_expr
>>> expr.define(sum)
>>> document = expr + -finished
Test it:
>>> document.parse(tokenize("3.1415926 * (2 + 7.18281828e-1) * 42"))
BinaryExpr(op='*', left=BinaryExpr(op='*', left=3.1415926, right=BinaryExpr(op='+', left=2, right=0.718281828)), right=42)
Let's pretty-print it using the pretty_tree(x, kids, show)
function:
>>> from funcparserlib.util import pretty_tree
>>> def pretty_expr(expr: Expr) -> str:
...
... def kids(expr: Expr) -> List[Expr]:
... if isinstance(expr, BinaryExpr):
... return [expr.left, expr.right]
... else:
... return []
...
... def show(expr: Expr) -> str:
... if isinstance(expr, BinaryExpr):
... return f"BinaryExpr({expr.op!r})"
... else:
... return repr(expr)
...
... return pretty_tree(expr, kids, show)
Test it:
>>> print(pretty_expr(document.parse(tokenize("3.1415926 * (2 + 7.18281828e-1) * 42"))))
BinaryExpr('*')
|-- BinaryExpr('*')
| |-- 3.1415926
| `-- BinaryExpr('+')
| |-- 2
| `-- 0.718281828
`-- 42
Finally, we have a proper parse tree that is easy to understand and work with!
Next
We've finished writing our numeric expressions parser.
If you want to learn more, let's discuss a few tips and tricks about parsing in the next chapter.