LibCST icon indicating copy to clipboard operation
LibCST copied to clipboard

Add InsertStatementsVisitor

Open willcrichton opened this issue 5 years ago • 19 comments
trafficstars

Summary

This PR adds a CSTTransformer mixin that allows sub-classes to use insert_statements_before_current and insert_statements_after_current, as discussed in #263. For example, this transformer:

class InsertAssignAroundAssignVisitor(InsertStatementsVisitor):
    def leave_Integer(
        self, original_node: cst.Integer, updated_node: cst.Integer
    ) -> cst.Integer:
        self.insert_statements_before_current([cst.parse_statement("y = 1")])
        self.insert_statements_after_current([cst.parse_statement("z = 1")])
        return updated_node

Turns this code:

x = 1
if True:
    x = 1  

Into this code:

y = 1
x = 1
z = 1
if True:
    y = 1
    x = 1
    z = 1

This PR is not finalized. Specifically, my primary concern: in the previous iteration of the idea, I used on_leave with isinstance(node, STATEMENT_TYPE) to check whether to map the statement accumulator to node. However, @jimmylai suggested only using the specialize leave_* methods due to efficiency concerns. There are 8 different statements that inherit from BaseCompoundStatement, meaning I would need to implement the leave method for each of them, generating a lot of redundant code. Is that still worth it?

Beyond that, any comments on the design are welcome. Once we resolve the question above, I will add more unit tests.

Test Plan

I have added unit tests to _test_insert_statements.py.

willcrichton avatar Mar 18 '20 23:03 willcrichton

Yeah, implement visit_ and leave_ function for every statement-like node types can add a lot of code. We can simplify it by adding a helper and just calls the helper in those visit_/leave_ functions. So far, we only implement on_visit and on_leave when we implement a framework (e.g. matcher) or the on_visit can be highly utilized. Our codebases are large, several million lines of code in 10k+ files. We try to avoid introduce more function calls to keep a codemod run faster. I'd also like to hear other people's thought on this.

jimmylai avatar Mar 19 '20 00:03 jimmylai

I also think this should be implemented in the pattern of AddImportsVisitor and RemoveImportsVisitor for consistency. That means insert_statements_before_current and insert_statements_after_current will be static methods.

The other type of visitor helpers (GatherImportsVisitor, GatherExportsVisitor ) don't have static methods since they intended to apply on entire module to collect data. They are a different type of visitors.

Yes, currently AddImportsVisitor and RemoveImportsVisitor are implicitly run in CodemodCommand. We can consider introduce a declaration API like the METADATA_DEPENDENCY to allow explicit declaration and make it more applicable to general type of transformer.

jimmylai avatar Mar 19 '20 14:03 jimmylai

I don't think I understand the codemod design fully yet. How would insert_statements_before_* work if they're static methods? They necessarily rely on knowing where a visitor currently is in the AST (to know which node current refers to).

willcrichton avatar Mar 19 '20 18:03 willcrichton

@willcrichton

I don't think I understand the codemod design fully yet. How would insert_statements_before_* work if they're static methods?

No, I think you understand just fine, and I didn't think through my suggestion carefully enough :) Unlike add-imports and remove-imports, the caller of insert_statements_*_current doesn't have all the context needed directly at hand when calling. The temporary accumulators and leave_SimpleStatementLine pretty much have to live in the same visitor that is calling insert_statements_*_current. Given that, I don't see any choice other than to introduce a third type of visitor that is intended for subclassing. Hopefully the documentation can make this usage difference clear. Unless @jimmylai you have some other suggestion that I'm missing?

@jimmylai

We can consider introduce a declaration API like the METADATA_DEPENDENCY to allow explicit declaration and make it more applicable to general type of transformer.

I think it would be even better to introduce a well-known context key that is a list of additional visitors to post-run, so a post-run visitor can simply register itself in the context when you call its staticmethod. That way, a user of e.g. AddImportsVisitor doesn't have to worry about explicitly declaring that anywhere (and having silent failure if they forget), they can just call AddImportsVisitor.add_import(...) and that will automatically ensure the visitor runs. This design would also allow a post-run visitor to automatically register another post-run visitor, which seems like it could be useful in future.

carljm avatar Mar 19 '20 19:03 carljm

In my idea, when you call the staticmethod, you need to provide a CSTNode included in the current statement as a reference point to insert new statements before of after it. The static method can look up the statement collection (either body of Module or a BaseSuite subclass) to be inserted and store the insertion request in the context. e.g. https://github.com/Instagram/LibCST/blob/master/libcst/codemod/visitors/_add_imports.py#L73-L101 To look up the statement collection, we can refactor the logic you previously build as another visitor to keep it simple and call the visitor in the staticmethod.

Then, in your implementation of transformer, it can read data from context and insert statements in the target statement collection nodes.

jimmylai avatar Mar 19 '20 19:03 jimmylai

@jimmylai I'd be quite concerned about the perf impact of running a separate visitor pass on every call to insert_statements_*. Seems with heavy use of those methods, this would be a lot worse for perf than e.g. the on_leave usage you were concerned about.

carljm avatar Mar 19 '20 20:03 carljm

I don't have a strong opinion on this, since I'm not using the codemod interface anyway. But let me know what y'all decide and I can finalize the PR.

Small changelog:

  • Changed the structure a bit to fix some issues w/ inserting into nested statements and to deal with deleted statements.
  • Added more unit tests to reflect these changes.
  • Added the remaining leave and visit methods for BaseStatement types.
  • Added documentation to the visitor and its methods.

willcrichton avatar Mar 19 '20 21:03 willcrichton

@jimmylai I'd be quite concerned about the perf impact of running a separate visitor pass on every call to insert_statements_*. Seems with heavy use of those methods, this would be a lot worse for perf than e.g. the on_leave usage you were concerned about.

That does has perf impact when we call the helper multiple times in the same Module. To make it more efficient and do it in one pass, we can probably do this:

  1. insert_statements_* only records the target CSTNode and the statements to be inserted.
  2. In InsertStatementsVisitor.visit_StatementCollectionALikeNode, it records the statement stack. In on_visit, we would need to check if the current node is one of the target nodes from step 1. If it is, we record the current statement in a map.
  3. In InsertStatementsVisitor.leave_StatementCollectionALikeNode, we insert statements.

However, this requires overwrite on_visit in order to implement it easier. Let me know whether you like this idea or not.

jimmylai avatar Mar 19 '20 22:03 jimmylai

Yeah, using ParentNodeProvider is a good idea. It will be good if we can avoid the need to remember to call super() and have more consistent usage pattern for all provided additional helper visitors.

carljm avatar Mar 25 '20 20:03 carljm

I did a small proof of concept for the alternative design, and I have not figured out a way to make it work.

Implementation: https://gist.github.com/willcrichton/8c69b29665f91a5fdc1b7cda953e80e3 Unit test / usage example: https://gist.github.com/willcrichton/b91bed25aec24085b65c5bffe1454721

@jimmylai the issue I'm having is that the ContextAwareTransformer doesn't guarantee that the Module seen by InsertStatementsVisitor is the identity-equivalent object seen by the user of the insert_statements_before_current static method. This is because MetadataWrapper copies the tree. (Even if that wasn't an issue, other transformers could run before the inserter.)

This means that if I map the CSTNode to a list of statements to insert in a static method, then down the line the InsertStatementsVisitor has no way to know which node in the current tree that original node corresponds to.

Does that make sense? Do y'all have any ideas on how to address this issue?

Perhaps the broader issue is that it is difficult to refer to places in an AST without a literal pointer to the object in memory. Despite the immutability of the AST, pointers are fragile as node identifiers since they can change for many reasons (e.g. the MetadataWrapper). Ideally you could have a handle on a node that would somehow be preserved across tree transformations.

willcrichton avatar Mar 25 '20 21:03 willcrichton

@willcrichton I tried your code. Yes MetadataWrapper copies the tree which makes it tricky. We can disable the copy by passing unsafe_skip_copy=True to: https://github.com/Instagram/LibCST/blob/379c9d06e766ec48d283bd622d82034c41c178f0/libcst/codemod/_codemod.py#L85 https://github.com/Instagram/LibCST/blob/379c9d06e766ec48d283bd622d82034c41c178f0/libcst/codemod/_codemod.py#L90

Found some issues in your code, The first parameter to get_metadata() should be ParentNodeProvider instead of a node. To insert the statements to the target location, we need to overwrite the parent of current statement. It can be Module, SimpleStatementSuite or IndentedBlock which has a body contains a collection of statements. So we should record the statement suite node in the insertions and implement leave_Module, leave_SimpleStatementSuite and leave_IndentedBlock to modify the body attribute when the current node is in insertions. We also need to store the current statement node, so when we iterate over the statements in body attribute, we can know where to insert new statements.

To test your change easier, you can do something like RemoveUnusedImportsCommandTest which leverage a RemoveUnusedImportsCommand (implements CodemodCommand) to call the static function RemoveImportsVisitor.remove_unused_import_by_node. https://github.com/Instagram/LibCST/blob/master/libcst/codemod/commands/tests/test_remove_unused_imports.py And the ContextAwareTransformer implementation (RemoveImportsVisitor) is added to CodemodCommand. https://github.com/Instagram/LibCST/blob/0ff9a6c5b7f2380dfeca4be01644c9d7e9146a30/libcst/codemod/_command.py#L82

When we designed LibCST, we intended to make it immutable to prevent mess up the tree structure and keep tree modification simple. I've try apply the above suggestions on top of your change locally and it works fine. Let me know if you have more questions.

jimmylai avatar Mar 26 '20 05:03 jimmylai

@jimmylai still having trouble. I added the two unsafe_skip_copy checks, but the module identity is still changing. At your recommendation, I changed the test to have this visitor:

class AddAssignTransformer(VisitorBasedCodemodCommand):
    def visit_Assign(self, node: cst.Assign) -> None:
        InsertStatementsVisitor.insert_statements_before_node(
            self.context, node, [cst.parse_statement("y = 2")])

It seems that the module output identity of this visitor is different than the module input identity. Any ideas why that might be?

willcrichton avatar Mar 26 '20 18:03 willcrichton

Oh, it's because CSTNode.visit always changes the identity of the module, even if no changes occur. Example:

import libcst as cst

class Dummy(cst.CSTTransformer):
    pass

m = cst.parse_module('pass')
assert id(m) != id(m.visit(Foo())))

Why is this the case?

willcrichton avatar Mar 26 '20 19:03 willcrichton

...I see, it's because CSTNode._visit_and_replace_children always constructs a new node even if there are no changes to the AST.

@jimmylai this seems like a more fundamental issue. If a VisitorBasedCodemodCommand stores a pointer to the AST in InsertStatementsVisitor, that pointer will be invalidated after the initial visitor has transformed the tree, even if there are no actual transformations.

One possibility is to introduce a VisitorBasedCodemodCommand that inherits from ContextAwareVisitor instead of ContextAwareTransformer, which will then not cause any tree identity changes. What do you think?

willcrichton avatar Mar 26 '20 19:03 willcrichton

I'll look into the issue in the next few days.

jimmylai avatar Mar 27 '20 18:03 jimmylai

@willcrichton sorry, I didn't get enough time to look into detail of this change. I've talked to @carljm and he may be able to help you later.

jimmylai avatar Apr 09 '20 20:04 jimmylai

Thanks for your patience with this one, @willcrichton. Took some time to look into this more carefully today, and here are my conclusions so far:

  1. Immutable trees and re-creation on transform is pretty fundamental to LibCST's design. To make _visit_and_replace_children preserve node identity in unchanged cases would require rewriting every such method so that it first calls that method on all its children, checks whether they changed, and then decides whether to return self or a new instance accordingly. This is doable in principle, but questionably useful here; immutability of nodes would still mean that any change deeper in the tree would cause changed-identity to bubble up and prevent InsertStatementsVisitor from working on any ancestor node of anything that changed. Basically, I don't think it will be productive to pursue any strategy that relies on preservation of CSTNode identity across sequential transforms.

  2. The chained-transform pattern of e.g. AddImportsVisitor and RemoveImportsVisitor works well for composing independent transforms, where the input context for chained transforms can be described independently of node identity. But InsertStatementsVisitor isn't an independent transform like adding or removing imports; it's really a utility abstraction for use within the current transform. I don't think it will work to try to force InsertStatementsVisitor into the chained-transform pattern.

Given those conclusions, here are a few ideas I have for how this could move forward:

  1. Just go ahead with the mixin approach. I don't like this much, and clearly the super() gotchas will be a problem. It is, however, the simplest and easiest option.

  2. Make insert-after-current-statement a built-in utility of CSTTransformer, so it is always available. This is not ideal, since it means every transform will pay some cost even if it doesn't use the feature.

  3. Explore adding "batchable transformers" to LibCST, where multiple independent transformers can run on the same tree in a single pass, with their visit/leave methods called in sequence when defined, similar to how BatchableCSTVisitor works, but with the ability to also return a new sub-tree from leave methods. Then a utility like InsertStatements could be a transformer which can be batched with your main transform to provide the insert-statements feature. Effectively this would give us two different models for composing transforms: sequential composition (the add/remove-imports pattern) and parallel composition, for in-transform utilities.

cc @jimmylai for thoughts.

carljm avatar May 12 '20 17:05 carljm

Would you guys consider adding a more general approach to this problem. instead of adding this mixin with insert_statements_before_current and insert_statements_after_current, I'm thinking of having a FlattenSentinel,which would work not only with statements, but also with any kind of nodes accepting a sequence of nodes as children.

This strategy would be specially useful when dealing with wildcard matchers, as in this way we would not require to know what the parent is in order to modify the node.

For example then we could do something like:

class AddStamentsVisitor(...):
    def leave_Statement(
        self, original_node: cst.Statement, updated_node: cst.Statement
    ):
        return cst.FlattenSentinel([cst.parse_statement("y = 1"), updated_node, cst.parse_statement("z = 1")])

which would convert code like:

if x:
  print('x')

into:

y = 1
if x:
  y = 1
  print('x')
  z = 1
z = 1

or in a call.

class DuplicateArgsVisitor(...):
    def leave_Arg(
        self, original_node: cst.Arg, updated_node: cst.Arg
    ):
        return cst.FlattenSentinel([updated_node, updated_node])

would convert f(x, y) into f(x, x, y, y).

sk- avatar Aug 12 '20 13:08 sk-

I really like the idea of the flatten sentinel. I think I will look at implementing this.

cdonovick avatar Nov 02 '20 19:11 cdonovick