LibCST icon indicating copy to clipboard operation
LibCST copied to clipboard

[BUG] Generated Code not equivalent to AST

Open sk- opened this issue 4 years ago • 11 comments

When doing replacements in a node, it's easy to end up with a tree, which won't get correctly converted to source code. The issue is that libcst is not adding the necessary parentheses.

For example, the code below

import libcst.matchers as m
import libcst

module = libcst.parse_module('x * y')

matcher = m.Name(value='x')
new_module = m.replace(module, matcher, libcst.parse_expression("1 + 2"))
print(new_module.code)

will output

1 + 2 * y

whereas the expected code is:

(1 + 2) * y

I think it's reasonable to expect that a module is equivalent to parse_module(module.code).

sk- avatar Jul 12 '20 17:07 sk-

Good catch! The parentheses are added by attributes lpar/rpar and their values are a sequence. For parentheses, we didn't introduce something like MaybeSentinel used on other whilespace attributes. To fix this issue, we can automatically add the required parentheses in _visit_and_replace_children of BinaryOperation by checking the operator precedence. https://docs.python.org/3/reference/expressions.html#operator-precedence That way, when any of the children of BinaryOperation is updated, the missing parentheses will be automatically added if needed. Feel free to submit a PR to fix it and I'm more than happy to review it.

jimmylai avatar Jul 20 '20 02:07 jimmylai

@jimmylai I'm not sure _visit_and_replace_children is the right place to add this logic, as one could programmatically create a Module without using the visitors and then calling module.code, which won't trigger a visit, so the nodes won't be updated.

Maybe an option would be to change _codegen_impl and _BaseParenthesizedNode._parenthesize to receive the parent node.

Then _parerenthesize could be something like:

@contextmanager
def _parenthesize(self, parent: CSTNode, state: CodegenState) -> Generator[None, None, None]:
    need_pars = need_parentheses(self, parent)                 # new code
    if need_pars:                                              # new code
        LeftParen()._codegen(state)                            # new code
    for lpar in self.lpar:
        lpar._codegen(state)
    with state.record_syntactic_position(self):
        yield
    for rpar in self.rpar:
        rpar._codegen(state)
    if need_pars:                                              # new code
        RightParen()._codegen(state)                           # new code

Note that this does not only affect BinaryOperation, but also UnaryOperation, Comparison, etc. There are also two special cases:

  1. generator expressions, which need to be parenthesized when they are in a call with 2 or more arguments
  2. tuples when they become part of an expression (addition, call argument, comparison, etc).

I do have a custom implementation of replace which checks the precedence and the edge cases above when replacing the node, but it requires the ParentNodeProvider. This of course has its limitations as metadata is not recomputed. I'm missing though the same logic with a transformer. Would happily contribute it.

sk- avatar Jul 21 '20 23:07 sk-

@jimmylai I'm not sure _visit_and_replace_children is the right place to add this logic, as one could programmatically create a Module without using the visitors and then calling module.code, which won't trigger a visit, so the nodes won't be updated.

In our design, every LibCST node is an immutable dataclass instance and thus we only provide transformer patter for modifying the tree.

E.g. https://github.com/Instagram/LibCST/blob/master/libcst/_nodes/module.py#L33-L35

The matcher replace() API also uses a transformer to replace target nodes.

Do you still think adding the logic in _visit_and_replace_children won't work?

jimmylai avatar Jul 22 '20 23:07 jimmylai

Hi @jimmylai, I still think that _visit_and_replace_children won't work, as one can generate an AST without using the visitor pattern.

For example, the code below also triggers the bug:

import libcst

module = libcst.Module(body=[
    libcst.BinaryOperation(
        left = libcst.parse_expression("1 + 2"),
        operator = libcst.Multiply(), 
        right = libcst.Integer("3")
    )
])
print(module.code)

producing the output 1 + 2 * 3, but it should have been (1 + 2) * 3.

In this case _visit_and_replace_children won't be called. I checked it by adding print statements to the method definition in both Module and BinaryOperation

sk- avatar Jul 22 '20 23:07 sk-

@jimmylai WDYT?

sk- avatar Aug 05 '20 19:08 sk-

Yeah, the example tree you provided is created from bottom up and it's possible to create a tree like that. Use ParentProvider is not ideal. We don't use metadata in codegen and tree validation so far. Maybe we can run the transformer you mentioned when user calls code_for_node? That is the entry point of code generation. https://github.com/Instagram/LibCST/blob/c023fa7c4caff3fd2b3946080f9a58b539b10363/libcst/_nodes/module.py#L128

jimmylai avatar Aug 05 '20 20:08 jimmylai

I'm not suggesting to use the metadata in codegen, just to change _codegen_impl and _parenthesize to receive the parent, which is already available as the traversal is top to bottom.

With your proposal you will be traversing the tree twice, once to fix the parentheses and a second time to generate the code.

sk- avatar Aug 05 '20 20:08 sk-

@jimmylai Any chance to revisit this? It'd be great if we could discuss and agree on a proposal to fix this.

After looking deeper into how other syntactic elements are treated, I think we should also discuss whether parens should accept a MaybeSentinel as a default, or whether an empty sequence would always mean MaybeSentinel.

Also related to how some other elements are generated, another alternative is to check in the parent whether a child needs parentheses.

sk- avatar Sep 21 '20 15:09 sk-

I just ran into this bug as well.

It may not be pretty but a simple work around is:

class FixParens(cst.CSTTransformer):
    def _wrap_with_parens(self, node):
        return node.with_changes(
                lpar=[cst.LeftParen()],
                rpar=[cst.RightParen()],
            )

    def leave_BinaryOperation(self,
            original_node: cst.BinaryOperation,
            update_node: cst.BinaryOperation
            ) -> cst.BinaryOperation:
        if not update_node.lpar:
            assert not update_node.rpar
            return self._wrap_with_parens(update_node)
        return update_node
    
    # similar for leave_UnaryOperation

Now this has the downside that it inserts parens where they are not needed but does guarantee hand built expressions are properly grouped.

cdonovick avatar Oct 27 '20 00:10 cdonovick

That solution would make it impracticable to use in a code refactoring tool, as lots of unneeded parentheses would be added and unfortunately code formatters do not remove extra parens.

I do have a more comprehensive transformer that only adds the parenthesis when needed but the problem is that it requires the parent and ParentNodeProvider does not recompute the metadata, that's why I'm asking for a baked in solution.

sk- avatar Oct 27 '20 14:10 sk-

@sk- agreed. I actually use an additional guard (updated_node is not original_node) to minimize the "damage" it does. Certainly not ideal though. Definitely something that should be handled in the code generator.

cdonovick avatar Oct 27 '20 20:10 cdonovick