baron icon indicating copy to clipboard operation
baron copied to clipboard

Move backend to lib2to3

Open Psycojoker opened this issue 11 years ago • 5 comments
trafficstars

Thanks to Guido extensive advertisement (meh.), I've realised that a lossless AST (parse tree) is build in the python stl under the name "lib2to3" for the 2to3 tool.

Since this parser is way more efficient than the current one we should move to it while not breaking the API (not that much a problem, we have gazillions of tests).

Since lib2to3 is not really well documented (see bottom of the page), here are 2 snippets on the high level API (taken from pythoscope):

def parse(code):
    """String -> AST

    Parse the string and return its AST representation. May raise
    a ParseError exception.
    """
    added_newline = False
    if not code.endswith("\n"):
        code += "\n"
        added_newline = True

    try:
        drv = driver.Driver(pygram.python_grammar, pytree.convert)
        result = drv.parse_string(code, True)
    except ParseError:
        log.debug("Had problems parsing:\n%s\n" % quoted_block(code))
        raise

    # Always return a Node, not a Leaf.
    if isinstance(result, Leaf):
        result = Node(syms.file_input, [result])

    result.added_newline = added_newline

    return result
def regenerate(tree):
    """AST -> String

    Regenerate the source code from the AST tree.
    """
    if hasattr(tree, 'added_newline') and tree.added_newline:
        return str(tree)[:-1]
    else:
        return str(tree)

And the str method of Node and Leaf:

class Node(Base):
    # ...
    def __unicode__(self):
        """
        Return a pretty string representation.

        This reproduces the input source exactly.
        """
        return "".join(map(str, self.children))

    if sys.version_info > (3, 0):
        __str__ = __unicode__

class Leaf(Base):
    # ...
    def __unicode__(self):
        """ 
        Return a pretty string representation.

        This reproduces the input source exactly.
        """
        return self.prefix + str(self.value)

    if sys.version_info > (3, 0): 
        __str__ = __unicode__

As you can see, every node is responsible for holding the formatting behind itself (one approach I've considered, don't remember why I haven't followed it).

Because it's directly generated out of the grammar file, the resulting AST it quite rough and doesn't seems very funny to use (I haven't found any pretty printer for it and pprint doesn't work on it):

In [3]: parse(open("/home/psycojoker/code/python/redbaron/pu.py", "r").read())
Out[3]: Node(file_input, [Node(simple_stmt, [Node(import_name, [Leaf(1, 'import'), Leaf(1, 'json')]),
           Leaf(4, '\n')]), Node(simple_stmt, [Node(import_from, [Leaf(1, 'from'), Leaf(1, 'redbaron'),
           Leaf(1, 'import'), Leaf(1, 'RedBaron')]), Leaf(4, '\n')]), Node(simple_stmt, [Node(expr_stmt, 
           [Leaf(1, 'red'), Leaf(22, '='), Node(power, [Leaf(1, 'RedBaron'), Node(trailer, [Leaf(7, '('), 
           Leaf(3, '"try:\\n pass\\nfinally:\\n pass"'), Leaf(8, ')')])])]), Leaf(4, '\n')]), Node(simple_stmt, 
           [Node(print_stmt, [Leaf(1, 'print'), Node(power, [Leaf(1, 'json'), Node(trailer, [Leaf(23, '.'), 
           Leaf(1, 'dumps')]), Node(trailer, [Leaf(7, '('), Node(arglist, [Node(power, [Leaf(1, 'red'), 
           Node(trailer, [Leaf(9, '['), Leaf(2, '0'), Leaf(10, ']')]), Node(trailer, [Leaf(23, '.'), Leaf(1, 
           '_render')]), Node(trailer, [Leaf(7, '('), Leaf(8, ')')])]), Leaf(12, ','), Node(argument, [Leaf(1, 
           'indent'), Leaf(22, '='), Leaf(2, '4')])]), Leaf(8, ')')])])]), Leaf(4, '\n')]), Node(simple_stmt, 
           [Node(import_from, [Leaf(1, 'from'), Leaf(1, 'ipdb'), Leaf(1, 'import'), Leaf(1, 'set_trace')]), 
           Leaf(13, ';'), Node(power, [Leaf(1, 'set_trace'), Node(trailer, [Leaf(7, '('), Leaf(8, ')')])]), 
           Leaf(4, '\n')]), Leaf(0, '')])

For reference, the parsed file content:

import json
from redbaron import RedBaron

red = RedBaron("try:\n pass\nfinally:\n pass")

print json.dumps(red[0]._render(), indent=4)
from ipdb import set_trace; set_trace()

Simpler parsing:

In [7]: pprint(parse("1 + 1"))
Node(file_input, [Node(simple_stmt, [Node(arith_expr, [Leaf(2, '1'), Leaf(14, '+'), Leaf(2, '1')]), Leaf(4, '\n')]), Leaf(0, '')])

In [8]: pprint(parse("import            a"))
Node(file_input, [Node(simple_stmt, [Node(import_name, [Leaf(1, 'import'), Leaf(1, 'a')]), Leaf(4, '\n')]), Leaf(0, '')])

For reference, lib2to3 car be used to write customized refactoring: http://python3porting.com/fixers.html

The result is ... well ... I'll let you made your own mind.

So plan is to investigate on own we can use this. Current state of reflection:

  • let's drop the lexer and the parser generator
  • I'm pretty sure that the fact that the formatting is handle differently will cause troubles
  • one fast approach would be to convert the resulted AST into Baron AST, that would be simple but probably not that efficient
  • another approach would be to modify lib2to3 AST generator to directly generate Baron AST, that would probably be more efficient but probably way harder to dev.

Today was not a funny day at all.

Psycojoker avatar Sep 14 '14 15:09 Psycojoker

You might like to look at yapf which uses lib2to3 to make a similar F-ishST (ast + spliced in comments). The objective there is not to keep the full syntax tree (e.g. no blank lines) but to infer the "best" whitespace/formatting.

hayd avatar Apr 23 '15 05:04 hayd

Here's a test file which 2to3 does not do well with (something to thinking about when porting)

from __future__ import print_function

# Assigning print
x = print

# Generator expression with trailing comma in function call
print(
    y for y in (1, 2, 3),
)

# class with a name of "nonlocal" (py3 keyword)
class nonlocal: pass

# File does not end with a newline

asottile avatar May 26 '16 19:05 asottile

Thx you both for providing this information :)

Psycojoker avatar May 27 '16 00:05 Psycojoker

Any decision been made here on whether to use lib2to3? Would love to hear if you learned anything if you went deeper into lib2to3 and either found it useful or not useful.

alexbw avatar Mar 12 '17 13:03 alexbw

Hello @alexbw,

Nope, I don't have the time anymore to make this kind of heavy development, that would require me to commit back fulltime on (red)baron which I'm not in a position to do right now :/

Psycojoker avatar Mar 28 '17 02:03 Psycojoker