code_diff
code_diff copied to clipboard
Information about the position of the modified lines / tokens
Hi, thank you very much for implementing this library.
I would like to have more information about its use for the following scenario: given two code snippets, I would like to know the location of the lines and tokens that have been changed or added. Let's take this case as an example:
Code
output = cd.difference( ''' String var = "Hello"; int x = x + 1 + 8; List<String> list = new ArrayList<>(); ''', ''' String var1 = "Hello"; int x = x + 1; List<String> list = new ArrayList<>(); List<String> list2 = new ArrayList<>(); ''', lang = "java")
As you can see, the difference between the two code snippets lies in line 1 (renaming the variable 'var' to 'var1'), line 2 (adding a new addition), and finally, line 4 (adding a new statement).
The output of this execution is as follows:
Output
[ Insert((local_variable_declaration, N0), (program, line 1:8 - 4:4), 3), Insert((generic_type, N1), N0, 0), Insert((variable_declarator, N2), N0, 1), Insert(;:;, N0, 2), Update((identifier:var, line 1:15 - 1:18), var1), Move((binary_expression, line 2:16 - 2:21), (variable_declarator, line 2:12 - 2:25), 2), Insert(type_identifier:List, N1, 0), Insert((type_arguments, N3), N1, 1), Insert(identifier:list2, N2, 0), Insert(=:=, N2, 1), Insert((object_creation_expression, N4), N2, 2), Insert(<:<, N3, 0), Insert(type_identifier:String, N3, 1), Insert(>:>, N3, 2), Insert(new:new, N4, 0), Insert((generic_type, N5), N4, 1), Insert((argument_list, N6), N4, 2), Insert(type_identifier:ArrayList, N5, 0), Insert((type_arguments, N7), N5, 1), Insert((:(, N6, 0), Insert():), N6, 1), Insert(<:<, N7, 0), Insert(>:>, N7, 1), Delete((+:+, line 2:22 - 2:23)), Delete((decimal_integer_literal:8, line 2:24 - 2:25)), Delete((binary_expression, line 2:16 - 2:25)) ]
From this output, however, I cannot get the information I mentioned above. Does the library support this functionality or is there a way to get this information?
Thanks again for the support.
Hey! I am not so sure what output is expected here. Could you elaborate a bit?
Currently, the AST edit contains all information you are mentioning:
line 1 (renaming the variable 'var' to 'var1')
Update((identifier:var, line 1:15 - 1:18), var1)
line 2 (adding a new addition)
You probably mean the deletion of the addition: Move((binary_expression, line 2:16 - 2:21), (variable_declarator, line 2:12 - 2:25), 2), Delete((+:+, line 2:22 - 2:23)), Delete((decimal_integer_literal:8, line 2:24 - 2:25)), Delete((binary_expression, line 2:16 - 2:25))
line 4 (adding a new statement)
Insert((local_variable_declaration, N0), (program, line 1:8 - 4:4), 3), ... (remaining part of the edit)
Since it is an AST edit, it builds the AST tree of the new statement (with root N0).
Thank you very much for the timely response. What I wish to get from the output is:
- figure out which lines have been changed
- get the position of each modified token (relative to the origin)
Consider this other example (with better formatting):
output = cd.difference( ''' String var = "Hello"; int x = x + 1; Listlist = new ArrayList(); ''', ''' String var1 = "Hello"; int x = x + 1 + 8; List list = new ArrayList(); List list2 = new ArrayList(); ''', lang = "java")
With output:
[ Insert((local_variable_declaration, N0), (program, line 1:8 - 4:4), 3), Insert((generic_type, N1), N0, 0), Insert((variable_declarator, N2), N0, 1), Insert(;:;, N0, 2), Update((identifier:var, line 1:15 - 1:18), var1), Insert((binary_expression, N3), (variable_declarator, line 2:12 - 2:21), 2), Insert(type_identifier:List, N1, 0), Insert((type_arguments, N4), N1, 1), Insert(identifier:list2, N2, 0), Insert(=:=, N2, 1), Insert((object_creation_expression, N5), N2, 2), Move((binary_expression, line 2:16 - 2:21), N3, 0), Insert(+:+, N3, 1), Insert(decimal_integer_literal:8, N3, 2), Insert(<: n4 insert>:>, N4, 2), Insert(new:new, N5, 0), Insert((generic_type, N6), N5, 1), Insert((argument_list, N7), N5, 2), Insert(type_identifier:ArrayList, N6, 0), Insert((type_arguments, N8), N6, 1), Insert((:(, N7, 0), Insert():), N7, 1), Insert(<: n8 insert>:>, N8, 1) ]
What I wish to know is:
- in line 1 position 15 I change 'var' to 'var1' (I get this information from "Update((identifier:var, line 1:15 - 1:18), var1)" )
- from line 2 position 21 I add ''+ 8" (this information I cannot extract from the output)
- line 4 i added a new statement.
So, starting from this output, the parser I am working on should return the following result:
- line 1 position 15
- line 2 position 21
- line 4 position 0 (the whole line)
Do you think it is possible to get these results from the output of cd.difference() ?
I thank you again for your support.
code.diff
sadly does not implement a functionality that directly supports your use case.
Do you think it is possible to get these results from the output of cd.difference() ?
This should still be feasible. Here, a very hacky solution:
from code_diff.gumtree import ops
def _subtrees(script):
subtrees = {}
for action in script:
if not isinstance(action, (ops.Insert, ops.Move)): continue
target, node, position = action.target_node, action.node, action.position
if isinstance(action, ops.Insert):
_, text = node
insert_content = text if text is not None else action.insert_id
elif isinstance(action, ops.Move):
insert_content = node
if hasattr(target, "node_id"):
target_id = target.node_id
if target_id not in subtrees: subtrees[target_id] = []
subtrees[target_id].insert(position, insert_content)
return subtrees
def _serialize_tree(subtrees, node_id):
result = []
stack = [node_id]
while len(stack) > 0:
element = stack.pop(0)
if isinstance(element, int):
stack = subtrees.get(element, []) + stack
else:
result.append(element)
return result
def flatten_script(script):
result_script = []
subtrees = _subtrees(script)
for action in script:
if isinstance(action, ops.Insert):
if hasattr(action.target_node, "node_id"): continue # Ignore because we flatten
new_node = _serialize_tree(subtrees, action.insert_id)
result_script.append(ops.Insert(action.target_node, new_node, position = action.position, insert_id=action.insert_id))
elif isinstance(action, ops.Move) and hasattr(action.target_node, "node_id"):
result_script.append(ops.Delete(action.node))
else:
result_script.append(action)
return result_script
def synthesize_rewrite_script(script):
# Flatten the script: Build and parse the subtrees that are inserted or moved
flat_script = flatten_script(script)
# Generate new actions of the form (replace_span, token_seq)
# You can transform the source by replacing each span with the token sequence
result = []
for action in flat_script:
target_node = action.target_node
if isinstance(action, ops.Insert):
if action.position == len(target_node.children):
(start_line, start_pos), (end_line, end_pos) = target_node.position[1], target_node.position[1]
else:
predecessor = target_node.children[action.position]
(start_line, start_pos), (end_line, end_pos) = predecessor.position[1], predecessor.position[1]
result.append(((start_line, start_pos, end_line, end_pos), action.node))
elif isinstance(action, ops.Update):
(start_line, start_pos), (end_line, end_pos) = target_node.position
result.append(((start_line, start_pos, end_line, end_pos), [action.value]))
elif isinstance(action, ops.Delete):
(start_line, start_pos), (end_line, end_pos) = target_node.position
result.append(((start_line, start_pos, end_line, end_pos), []))
return result
synthesize_rewrite_script(edit_script)
essentially generates a sequence of edit operations that are text span replacements. Since we essentially fold the generated trees, this should only contain changes which are interesting for you.
For example:
output = cd.difference(
'''
String var = "Hello";
int x = x + 1;
List list = new ArrayList<>();
''',
'''
String var1 = "Hello";
int x = x + 1 + 8;
List list = new ArrayList<>();
List list2 = new ArrayList<>();
''',
lang = "java")
script = output.edit_script()
rewrite_script = synthesize_rewrite_script(script)
# rewrite_script ==
# [
# ((4, 0, 4, 0), ['List', 'list2', '=', 'new', 'ArrayList', '<', '>', '(', ')', ';']), // Insert statement in line 4
# ((1, 7, 1, 10), ['var1']), // Update var in line 2
# ((2, 13, 2, 13), [ASTNode(type=binary_expression), '+', '8']), // Insert x + 1 + 8 in line 2 after x + 1
# ((2, 8, 2, 13), []) // Delete x + 1
#]
Thank you very much! This is exactly what I was looking for. I will try to use this solution you proposed :smile:
Since this functionality might be useful for others as well, I would like to point out changes to your code.
Considering this input:
output = cd.difference(
'''
String var = "Hello";
int x = x + 1;
List list = new ArrayList<>();
''',
'''
int x = x + 1 + 8;
String var1 = "Hello";
List list = new ArrayList<>();
List list2 = new ArrayList<>();
''',
lang = "java")
The current code returns the following exception:
IndexError Traceback (most recent call last)
[<ipython-input-78-a26ab0e73672>](https://localhost:8080/#) in <cell line: 96>()
94
95 script = output.edit_script()
---> 96 rewrite_script = synthesize_rewrite_script(script)
97 rewrite_script
[<ipython-input-78-a26ab0e73672>](https://localhost:8080/#) in synthesize_rewrite_script(script)
62 (start_line, start_pos), (end_line, end_pos) = target_node.position[1], target_node.position[1]
63 else:
---> 64 predecessor = target_node.children[action.position]
65 (start_line, start_pos), (end_line, end_pos) = predecessor.position[1], predecessor.position[1]
66
IndexError: list index out of range
Currently, I have solved the problem by returning the action position as a line. For the column, however, the value 0 is sufficient. This is the changed code:
def synthesize_rewrite_script(script):
# Flatten the script: Build and parse the subtrees that are inserted or moved
flat_script = flatten_script(script)
# Generate new actions of the form (replace_span, token_seq)
# You can transform the source by replacing each span with the token sequence
result = []
for action in flat_script:
target_node = action.target_node
if isinstance(action, ops.Insert):
if action.position == len(target_node.children):
(start_line, start_pos), (end_line, end_pos) = target_node.position[1], target_node.position[1]
elif len(target_node.children) >= action.position:
predecessor = target_node.children[action.position]
(start_line, start_pos), (end_line, end_pos) = predecessor.position[1], predecessor.position[1]
else:
# If there are no child nodes, consider use position as line (columns are not needed)
(start_line, start_pos) = (end_line, end_pos) = (action.position, 0)
result.append(((start_line, start_pos, end_line, end_pos), action.node))
elif isinstance(action, ops.Update):
(start_line, start_pos), (end_line, end_pos) = target_node.position
result.append(((start_line, start_pos, end_line, end_pos), [action.value]))
elif isinstance(action, ops.Delete):
(start_line, start_pos), (end_line, end_pos) = target_node.position
result.append(((start_line, start_pos, end_line, end_pos), []))
return result
Now the output of the script is:
[((4, 0, 4, 0),
['List', 'list2', '=', 'new', 'ArrayList', '<', '>', '(', ')', ';']),
((2, 13, 2, 13), [ASTNode(type=binary_expression), '+', '8']),
((1, 7, 1, 10), ['var1']),
((2, 8, 2, 13), [])]
I have tried inserting lines in different positions and it seems that this modification generalizes the problem well.
The only problem I am experiencing now that I cannot solve is the following. As you can see, the action of modifying the line "String var1 = "Hello";" is reported to me at line 1 (instead of line 2), which is in the source code line. The same goes for the modification of "int x = x + 1 + 8;" which is reported to me at line 2 instead of line 1 (start and end columns are also wrong). Is it possible to report the actual line in the target code?