funcparserlib.lexer
— Regexp-based tokenizer
funcparserlib.lexer.make_tokenizer(specs)
Make a function that tokenizes text based on the regexp specs.
Type: (Sequence[TokenSpec | Tuple]) -> Callable[[str], Iterable[Token]]
A token spec is TokenSpec
instance.
Note
For legacy reasons, a token spec may also be a tuple of (type, args), where
type sets the value of Token.type
for the token, and args are the
positional arguments for re.compile()
: either just (pattern,) or
(pattern, flags).
It returns a tokenizer function that takes a string and returns an iterable of
Token
objects, or raises LexerError
if it cannot tokenize the string according
to its token specs.
Examples:
>>> tokenize = make_tokenizer([
... TokenSpec("space", r"\s+"),
... TokenSpec("id", r"\w+"),
... TokenSpec("op", r"[,!]"),
... ])
>>> text = "Hello, World!"
>>> [t for t in tokenize(text) if t.type != "space"] # noqa
[Token('id', 'Hello'), Token('op', ','), Token('id', 'World'), Token('op', '!')]
>>> text = "Bye?"
>>> list(tokenize(text))
Traceback (most recent call last):
...
lexer.LexerError: cannot tokenize data: 1,4: "Bye?"
funcparserlib.lexer.TokenSpec
A token specification for generating a lexer via make_tokenizer()
.
funcparserlib.lexer.TokenSpec.__init__(type, pattern, flags=0)
Initialize a TokenSpec
object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
type |
str
|
User-defined type of the token (e.g. |
required |
pattern |
str
|
Regexp for matching this token type |
required |
flags |
int
|
Regexp flags, the second argument of |
0
|
funcparserlib.lexer.Token
A token object that represents a substring of certain type in your text.
You can compare tokens for equality using the ==
operator. Tokens also define
custom repr()
and str()
.
Attributes:
Name | Type | Description |
---|---|---|
type |
str
|
User-defined type of the token (e.g. |
value |
str
|
Text value of the token |
start |
Optional[Tuple[int, int]]
|
Start position (line, column) |
end |
Optional[Tuple[int, int]]
|
End position (line, column) |