baron
baron copied to clipboard
Move backend to lib2to3
Thanks to Guido extensive advertisement (meh.), I've realised that a lossless AST (parse tree) is build in the python stl under the name "lib2to3" for the 2to3 tool.
Since this parser is way more efficient than the current one we should move to it while not breaking the API (not that much a problem, we have gazillions of tests).
Since lib2to3 is not really well documented (see bottom of the page), here are 2 snippets on the high level API (taken from pythoscope):
def parse(code):
"""String -> AST
Parse the string and return its AST representation. May raise
a ParseError exception.
"""
added_newline = False
if not code.endswith("\n"):
code += "\n"
added_newline = True
try:
drv = driver.Driver(pygram.python_grammar, pytree.convert)
result = drv.parse_string(code, True)
except ParseError:
log.debug("Had problems parsing:\n%s\n" % quoted_block(code))
raise
# Always return a Node, not a Leaf.
if isinstance(result, Leaf):
result = Node(syms.file_input, [result])
result.added_newline = added_newline
return result
def regenerate(tree):
"""AST -> String
Regenerate the source code from the AST tree.
"""
if hasattr(tree, 'added_newline') and tree.added_newline:
return str(tree)[:-1]
else:
return str(tree)
And the str method of Node and Leaf:
class Node(Base):
# ...
def __unicode__(self):
"""
Return a pretty string representation.
This reproduces the input source exactly.
"""
return "".join(map(str, self.children))
if sys.version_info > (3, 0):
__str__ = __unicode__
class Leaf(Base):
# ...
def __unicode__(self):
"""
Return a pretty string representation.
This reproduces the input source exactly.
"""
return self.prefix + str(self.value)
if sys.version_info > (3, 0):
__str__ = __unicode__
As you can see, every node is responsible for holding the formatting behind itself (one approach I've considered, don't remember why I haven't followed it).
Because it's directly generated out of the grammar file, the resulting AST it quite rough and doesn't seems very funny to use (I haven't found any pretty printer for it and pprint doesn't work on it):
In [3]: parse(open("/home/psycojoker/code/python/redbaron/pu.py", "r").read())
Out[3]: Node(file_input, [Node(simple_stmt, [Node(import_name, [Leaf(1, 'import'), Leaf(1, 'json')]),
Leaf(4, '\n')]), Node(simple_stmt, [Node(import_from, [Leaf(1, 'from'), Leaf(1, 'redbaron'),
Leaf(1, 'import'), Leaf(1, 'RedBaron')]), Leaf(4, '\n')]), Node(simple_stmt, [Node(expr_stmt,
[Leaf(1, 'red'), Leaf(22, '='), Node(power, [Leaf(1, 'RedBaron'), Node(trailer, [Leaf(7, '('),
Leaf(3, '"try:\\n pass\\nfinally:\\n pass"'), Leaf(8, ')')])])]), Leaf(4, '\n')]), Node(simple_stmt,
[Node(print_stmt, [Leaf(1, 'print'), Node(power, [Leaf(1, 'json'), Node(trailer, [Leaf(23, '.'),
Leaf(1, 'dumps')]), Node(trailer, [Leaf(7, '('), Node(arglist, [Node(power, [Leaf(1, 'red'),
Node(trailer, [Leaf(9, '['), Leaf(2, '0'), Leaf(10, ']')]), Node(trailer, [Leaf(23, '.'), Leaf(1,
'_render')]), Node(trailer, [Leaf(7, '('), Leaf(8, ')')])]), Leaf(12, ','), Node(argument, [Leaf(1,
'indent'), Leaf(22, '='), Leaf(2, '4')])]), Leaf(8, ')')])])]), Leaf(4, '\n')]), Node(simple_stmt,
[Node(import_from, [Leaf(1, 'from'), Leaf(1, 'ipdb'), Leaf(1, 'import'), Leaf(1, 'set_trace')]),
Leaf(13, ';'), Node(power, [Leaf(1, 'set_trace'), Node(trailer, [Leaf(7, '('), Leaf(8, ')')])]),
Leaf(4, '\n')]), Leaf(0, '')])
For reference, the parsed file content:
import json
from redbaron import RedBaron
red = RedBaron("try:\n pass\nfinally:\n pass")
print json.dumps(red[0]._render(), indent=4)
from ipdb import set_trace; set_trace()
Simpler parsing:
In [7]: pprint(parse("1 + 1"))
Node(file_input, [Node(simple_stmt, [Node(arith_expr, [Leaf(2, '1'), Leaf(14, '+'), Leaf(2, '1')]), Leaf(4, '\n')]), Leaf(0, '')])
In [8]: pprint(parse("import a"))
Node(file_input, [Node(simple_stmt, [Node(import_name, [Leaf(1, 'import'), Leaf(1, 'a')]), Leaf(4, '\n')]), Leaf(0, '')])
For reference, lib2to3 car be used to write customized refactoring: http://python3porting.com/fixers.html
The result is ... well ... I'll let you made your own mind.
So plan is to investigate on own we can use this. Current state of reflection:
- let's drop the lexer and the parser generator
- I'm pretty sure that the fact that the formatting is handle differently will cause troubles
- one fast approach would be to convert the resulted AST into Baron AST, that would be simple but probably not that efficient
- another approach would be to modify lib2to3 AST generator to directly generate Baron AST, that would probably be more efficient but probably way harder to dev.
Today was not a funny day at all.
You might like to look at yapf which uses lib2to3 to make a similar F-ishST (ast + spliced in comments). The objective there is not to keep the full syntax tree (e.g. no blank lines) but to infer the "best" whitespace/formatting.
Here's a test file which 2to3 does not do well with (something to thinking about when porting)
from __future__ import print_function
# Assigning print
x = print
# Generator expression with trailing comma in function call
print(
y for y in (1, 2, 3),
)
# class with a name of "nonlocal" (py3 keyword)
class nonlocal: pass
# File does not end with a newline
Thx you both for providing this information :)
Any decision been made here on whether to use lib2to3? Would love to hear if you learned anything if you went deeper into lib2to3 and either found it useful or not useful.
Hello @alexbw,
Nope, I don't have the time anymore to make this kind of heavy development, that would require me to commit back fulltime on (red)baron which I'm not in a position to do right now :/