sourceror Add a function to extract patches from a changed tree

This is a follow up of https://github.com/doorgan/sourceror/issues/13#issuecomment-864484485

Sourceror.patch_string/2 is able to apply textual patches to multiple ranges of text, but it doesn't check for overlapping ranges, meaning the user has to generate a list of non overlapping patches first, or use some other approach if multiple patches are needed.

It would be useful if Sourceror was able to know which nodes were changed during a traversal, such that a second traversal could be performed to collect the old ranges of the modified nodes, and then create a list of non overlapping patches.

For example, if we had some code like this:

unless not allowed? do
  String.to_atom(foo)
end

And a traversal changes the tree such that String.to_atom gets converted to String.to_existing_atom and unless not allowed? gets converted to if allowed?, we would have two overlapping patches, since the inner change would override changes to the outer one and viceversa, depending on the order they are applied.

However, if changed nodes are marked as such, it becomes trivial to generate a single patch that does both changes but affects only the biggest range, namely the outer one.

We can leverage the Sourceror.prewalk/postwalk functions for this: we keep a copy of a node before visiting it, and compare it with the node after being visited. If the nodes are different, then the node was changed and we add a sourceror: [changed: %{range: original_node_range}] field to the node's metadata.

If instead of pre/postwalk we're using the Zipper API, then functions that change a node should add the same metadata to the node. On deletion, the marker must be added to the parent node metadata before going a step back. On addition of new nodes, the marker must be added to the parent node as well. Users are encouraged to modify nodes through the Zipper API instead of replacing them directly in the zipper.

In both types of traversal, if the node does not have metadata(because we are in a leaf or in a literal for the case of the regular ast), the marker metadata must be added to the parent node metadata if possible.

If the tree does not have the line and column information necessary to generate the ranges, then it makes no sense to try to generate a patch, so in these cases nothing should be done. This is inconsistent behavior, but really I don't see the point of marking a node as changed if we can't make anything useful with that information.

After all the changed nodes are marked, a Sourceror.patches_from_ast/1 can take the tree, traverse it, and produce patches from the node and the old range. After we get the full list of ranges, if we find overlapping ranges we just discard the ones with a smaller range, as they are already contained by a bigger one.

This way we can get a list of patches suitable for use with Sourceror.patch_string/2

Jun 21 '21 21:06 doorgan

Producing patches like this has some downsides as well. For example, inserting a sibling node could just produce a patch after that node instead of producing a patch for the whole parent node.

Jul 03 '21 13:07 doorgan

In the Ruby world, what we ended up doing is organizing string patches as a tree of nested ranges with replacements and/or insertions at either end. Crossing ranges (e.g. 1..10 and 5..15) generate errors. A more detailed explanation is in this PR.

Note: I've just started to look at sourceror today, looks super intersesting and I'm a newcomer to Elixir. I wrote the rewriter for the Ruby parser gem above and am a core member of RuboCop. I am wondering how to port RuboCop's node pattern to elixir; I wrote the compiler for Ruby.

Jul 03 '21 23:07 marcandre

@marcandre - welcome on board - another AST guru! Given your RuboCop experience, you might be interested in the work we are doing now to integrate with Credo. This integration will be a big project - lessons learned and feedback are welcome.

@doorgan - in Rfx we've been fiddling with approaches for diff and patch. It would probably be good if Rfx and Sourceror used identical or compatible diff/patch strategies.

Jul 04 '21 00:07 andyl

In the Ruby world, what we ended up doing is organizing string patches as a tree of nested ranges with replacements and/or insertions at either end. Crossing ranges (e.g. 1..10 and 5..15) generate errors. A more detailed explanation is in this PR.

Hi @marcandre! Your tree approach looks really interesting, I'll keep digging into it, thanks for the link! My concern in this issue is that I'm looking for a way to infer the patches to be generated based on a modified tree, but I see the approach from the parser gem is to have both text and AST representations, use the AST to navigate the code and then schedule patches on offsets in the source code, but the AST remains unchanged. Is this correct?

Note: I've just started to look at sourceror today, looks super intersesting and I'm a newcomer to Elixir. I wrote the rewriter for the Ruby parser gem above and am a core member of RuboCop. I am wondering how to port RuboCop's node pattern to elixir; I wrote the compiler for Ruby.

The Elixir AST is a really simple data structure, I have an article that you can use for reference, and I also wrote an article that describes Sourceror's approach. In essence(and specially in Sourceror's case) an AST node is a tuple with 3 elements like {:def, metadata, arguments} for a function definition, {:%{}, metadata, elements} for a map, and so on. Elixir also has pattern matching, so walking the AST looking for a specific node is trivial once you get used to it. For example, doing a pre-order traversal looking for all integer nodes:

Macro.prewalk(ast, fn {:__block__, _metadata, [integer]} when is_integer(integer) -> IO.puts("Found an integer!") end)

Now, I see that RuboCop's NodePath has some operators like ^ to move up in the tree, which is not possible by using this kind of traversal functions. Moreover, because nodes only know about their children and don't have properties like parent as they would have in other languages, moving around a tree in any arbitrary order can be complicated. For these use cases Sourceror provides a Zipper API, I wrote an explanation of Zippers in this document and you can see it in action towards the end of this other one. In essence, your position in the tree is stored in the tree itself, so the ^ from NodePath would be evaluated by the Zipper.up(ast) function.

I believe a port of NodePath should be achievable using that Zipper API. I've been entertaining similar ideas while looking at projects such as grasp, but I don't have anything concrete. I'm happy to discuss more about how this could be done and Elixir in general in this repo's discussions section if you're interested 🙂

Jul 04 '21 17:07 doorgan

in Rfx we've been fiddling with approaches for diff and patch. It would probably be good if Rfx and Sourceror used identical or compatible diff/patch strategies.

@andyl of course! Patches in this context are meant to be the textual changes derived from an change in the AST. In Rfx, diffs and patches are performed over the original and the modified text, is this correct?

What I'm exploring in this issue is a case like this one:

if not allowed? do
  # ...
end

A function can find the if with the negated condition, and then rename the node to unless and remove the negation. Now how do we translate this to a textual change? One option is to convert the whole node to string and return that. The issue here is that, because Sourceror depends on the Elixir formatter, there's a chance the full contents inside of the if will lose the original formatting. The other option is to craft a list of changes, like "replace the if range with unless, replace the negation range with nothing". When these changes are applied, we then get the expected modified string.

This modified string is what, to my understanding, would be used in Rfx to generate a diff and apply patches. I have been looking into how Rfx structures everything, and what I understood is that Rfx needs to produce a TextReq for a particular text change, and it only needs the full modified source code from where it then would generate the diffs.

The patch in Sourceror is merely a map consisting of a range(start and end positions with the same format as the one used in the AST metadata), and a change field which is the text to put in that range(or a function if you have some complex use case where Sourceror is not enough). It's meant to be used by the transformation function to avoid this "messing with the rest of the formatting". So this would happen at an earlier stage, and what Rfx needs is the result of these patches applied to the original code.

Did I get it right?

Jul 04 '21 17:07 doorgan

In the Ruby world, what we ended up doing is organizing string patches as a tree of nested ranges with replacements and/or insertions at either end. Crossing ranges (e.g. 1..10 and 5..15) generate errors. A more detailed explanation is in this PR.

Hi @marcandre! Your tree approach looks really interesting, I'll keep digging into it, thanks for the link! My concern in this issue is that I'm looking for a way to infer the patches to be generated based on a modified tree, but I see the approach from the parser gem is to have both text and AST representations, use the AST to navigate the code and then schedule patches on offsets in the source code, but the AST remains unchanged. Is this correct?

That is correct (although technically you don't even need to navigate the AST to schedule patches, for example you could just want to check the comments which are stored separately).

Rest replied in discussion

Jul 04 '21 17:07 marcandre

Patches in this context are meant to be the textual changes derived from an change in the AST

Yes. My first implementation was to shell out to linux diff/patch utilities. Would like to migrate to a pure-elixir solution.

Jul 04 '21 20:07 andyl

sourceror sourceror copied to clipboard

Add a function to extract patches from a changed tree

sourceror
sourceror copied to clipboard