code_diff icon indicating copy to clipboard operation
code_diff copied to clipboard

Information about the position of the modified lines / tokens

Open Devy99 opened this issue 1 year ago • 5 comments

Hi, thank you very much for implementing this library.

I would like to have more information about its use for the following scenario: given two code snippets, I would like to know the location of the lines and tokens that have been changed or added. Let's take this case as an example:

Code

output = cd.difference( ''' String var = "Hello"; int x = x + 1 + 8; List<String> list = new ArrayList<>(); ''', ''' String var1 = "Hello"; int x = x + 1; List<String> list = new ArrayList<>(); List<String> list2 = new ArrayList<>(); ''', lang = "java")

As you can see, the difference between the two code snippets lies in line 1 (renaming the variable 'var' to 'var1'), line 2 (adding a new addition), and finally, line 4 (adding a new statement).

The output of this execution is as follows:

Output

[ Insert((local_variable_declaration, N0), (program, line 1:8 - 4:4), 3), Insert((generic_type, N1), N0, 0), Insert((variable_declarator, N2), N0, 1), Insert(;:;, N0, 2), Update((identifier:var, line 1:15 - 1:18), var1), Move((binary_expression, line 2:16 - 2:21), (variable_declarator, line 2:12 - 2:25), 2), Insert(type_identifier:List, N1, 0), Insert((type_arguments, N3), N1, 1), Insert(identifier:list2, N2, 0), Insert(=:=, N2, 1), Insert((object_creation_expression, N4), N2, 2), Insert(<:<, N3, 0), Insert(type_identifier:String, N3, 1), Insert(>:>, N3, 2), Insert(new:new, N4, 0), Insert((generic_type, N5), N4, 1), Insert((argument_list, N6), N4, 2), Insert(type_identifier:ArrayList, N5, 0), Insert((type_arguments, N7), N5, 1), Insert((:(, N6, 0), Insert():), N6, 1), Insert(<:<, N7, 0), Insert(>:>, N7, 1), Delete((+:+, line 2:22 - 2:23)), Delete((decimal_integer_literal:8, line 2:24 - 2:25)), Delete((binary_expression, line 2:16 - 2:25)) ]

From this output, however, I cannot get the information I mentioned above. Does the library support this functionality or is there a way to get this information?

Thanks again for the support.

Devy99 avatar Jan 08 '24 20:01 Devy99

Hey! I am not so sure what output is expected here. Could you elaborate a bit?

Currently, the AST edit contains all information you are mentioning:

line 1 (renaming the variable 'var' to 'var1')

Update((identifier:var, line 1:15 - 1:18), var1)

line 2 (adding a new addition)

You probably mean the deletion of the addition: Move((binary_expression, line 2:16 - 2:21), (variable_declarator, line 2:12 - 2:25), 2), Delete((+:+, line 2:22 - 2:23)), Delete((decimal_integer_literal:8, line 2:24 - 2:25)), Delete((binary_expression, line 2:16 - 2:25))

line 4 (adding a new statement)

Insert((local_variable_declaration, N0), (program, line 1:8 - 4:4), 3), ... (remaining part of the edit)

Since it is an AST edit, it builds the AST tree of the new statement (with root N0).

cedricrupb avatar Jan 09 '24 10:01 cedricrupb

Thank you very much for the timely response. What I wish to get from the output is:

  • figure out which lines have been changed
  • get the position of each modified token (relative to the origin)

Consider this other example (with better formatting):

output = cd.difference(
    '''
        String var = "Hello";
        int x = x + 1;
        List list = new ArrayList();
    ''',
    '''
        String var1 = "Hello";
        int x = x + 1 + 8;
        List list = new ArrayList();
        List list2 = new ArrayList();
    ''',
lang = "java")

With output:

[
  Insert((local_variable_declaration, N0), (program, line 1:8 - 4:4), 3),
  Insert((generic_type, N1), N0, 0),
  Insert((variable_declarator, N2), N0, 1),
  Insert(;:;, N0, 2),
  Update((identifier:var, line 1:15 - 1:18), var1),
  Insert((binary_expression, N3), (variable_declarator, line 2:12 - 2:21), 2),
  Insert(type_identifier:List, N1, 0),
  Insert((type_arguments, N4), N1, 1),
  Insert(identifier:list2, N2, 0),
  Insert(=:=, N2, 1),
  Insert((object_creation_expression, N5), N2, 2),
  Move((binary_expression, line 2:16 - 2:21), N3, 0),
  Insert(+:+, N3, 1),
  Insert(decimal_integer_literal:8, N3, 2),
  Insert(<: n4 insert>:>, N4, 2),
  Insert(new:new, N5, 0),
  Insert((generic_type, N6), N5, 1),
  Insert((argument_list, N7), N5, 2),
  Insert(type_identifier:ArrayList, N6, 0),
  Insert((type_arguments, N8), N6, 1),
  Insert((:(, N7, 0),
  Insert():), N7, 1),
  Insert(<: n8 insert>:>, N8, 1)
]

What I wish to know is:

  • in line 1 position 15 I change 'var' to 'var1' (I get this information from "Update((identifier:var, line 1:15 - 1:18), var1)" )
  • from line 2 position 21 I add ''+ 8" (this information I cannot extract from the output)
  • line 4 i added a new statement.

So, starting from this output, the parser I am working on should return the following result:

  • line 1 position 15
  • line 2 position 21
  • line 4 position 0 (the whole line)

Do you think it is possible to get these results from the output of cd.difference() ?

I thank you again for your support.

Devy99 avatar Jan 09 '24 10:01 Devy99

code.diff sadly does not implement a functionality that directly supports your use case.

Do you think it is possible to get these results from the output of cd.difference() ?

This should still be feasible. Here, a very hacky solution:

from code_diff.gumtree import ops

def _subtrees(script):
    subtrees = {}
    for action in script:
        if not isinstance(action, (ops.Insert, ops.Move)): continue
        target, node, position = action.target_node, action.node, action.position

        if isinstance(action, ops.Insert):
            _, text = node
            insert_content = text if text is not None else action.insert_id
        elif isinstance(action, ops.Move):
            insert_content = node

        if hasattr(target, "node_id"):
            target_id = target.node_id
            if target_id not in subtrees: subtrees[target_id] = []
            subtrees[target_id].insert(position, insert_content)
    
    return subtrees

def _serialize_tree(subtrees, node_id):
    result = []
    stack  = [node_id]

    while len(stack) > 0:
        element = stack.pop(0)
        if isinstance(element, int):
            stack = subtrees.get(element, []) + stack
        else:
            result.append(element)

    return result

def flatten_script(script):
    result_script = []
    subtrees = _subtrees(script)

    for action in script:
        if isinstance(action, ops.Insert):
            if hasattr(action.target_node, "node_id"): continue # Ignore because we flatten
            new_node = _serialize_tree(subtrees, action.insert_id)
            result_script.append(ops.Insert(action.target_node, new_node, position = action.position, insert_id=action.insert_id))
        elif isinstance(action, ops.Move) and hasattr(action.target_node, "node_id"):
            result_script.append(ops.Delete(action.node))
        else:
            result_script.append(action)

    return result_script

def synthesize_rewrite_script(script):
    # Flatten the script: Build and parse the subtrees that are inserted or moved
    flat_script = flatten_script(script)

    # Generate new actions of the form (replace_span, token_seq)
    # You can transform the source by replacing each span with the token sequence 
    result = []
    for action in flat_script:
        target_node = action.target_node
        if isinstance(action, ops.Insert):
            if action.position == len(target_node.children):
                (start_line, start_pos), (end_line, end_pos) = target_node.position[1], target_node.position[1]
            else:
                predecessor = target_node.children[action.position]
                (start_line, start_pos), (end_line, end_pos) = predecessor.position[1], predecessor.position[1]

            result.append(((start_line, start_pos, end_line, end_pos), action.node))
        elif isinstance(action, ops.Update):
            (start_line, start_pos), (end_line, end_pos) = target_node.position
            result.append(((start_line, start_pos, end_line, end_pos), [action.value]))
        elif isinstance(action, ops.Delete):
            (start_line, start_pos), (end_line, end_pos) = target_node.position
            result.append(((start_line, start_pos, end_line, end_pos), []))

    return result

synthesize_rewrite_script(edit_script) essentially generates a sequence of edit operations that are text span replacements. Since we essentially fold the generated trees, this should only contain changes which are interesting for you.

For example:

output = cd.difference(
'''
String var = "Hello";
int x = x + 1;
List list = new ArrayList<>();
''',
'''
String var1 = "Hello";
int x = x + 1 + 8;
List list = new ArrayList<>();
List list2 = new ArrayList<>();
''',
lang = "java")

script = output.edit_script()
rewrite_script = synthesize_rewrite_script(script)

# rewrite_script == 
# [
# ((4, 0, 4, 0), ['List', 'list2', '=', 'new', 'ArrayList', '<', '>', '(', ')', ';']), // Insert statement in line 4
# ((1, 7, 1, 10), ['var1']), // Update var in line 2
# ((2, 13, 2, 13), [ASTNode(type=binary_expression), '+', '8']),  // Insert  x + 1 + 8 in line 2 after x + 1
# ((2, 8, 2, 13), [])  // Delete x + 1
#]

cedricrupb avatar Jan 09 '24 12:01 cedricrupb

Thank you very much! This is exactly what I was looking for. I will try to use this solution you proposed :smile:

Devy99 avatar Jan 09 '24 13:01 Devy99

Since this functionality might be useful for others as well, I would like to point out changes to your code.

Considering this input:

output = cd.difference(
'''
String var = "Hello";
int x = x + 1;
List list = new ArrayList<>();
''',
'''
int x = x + 1 + 8;
String var1 = "Hello";
List list = new ArrayList<>();
List list2 = new ArrayList<>();
''',
lang = "java")

The current code returns the following exception:

IndexError                                Traceback (most recent call last)
[<ipython-input-78-a26ab0e73672>](https://localhost:8080/#) in <cell line: 96>()
     94 
     95 script = output.edit_script()
---> 96 rewrite_script = synthesize_rewrite_script(script)
     97 rewrite_script

[<ipython-input-78-a26ab0e73672>](https://localhost:8080/#) in synthesize_rewrite_script(script)
     62                 (start_line, start_pos), (end_line, end_pos) = target_node.position[1], target_node.position[1]
     63             else:
---> 64                 predecessor = target_node.children[action.position]
     65                 (start_line, start_pos), (end_line, end_pos) = predecessor.position[1], predecessor.position[1]
     66 

IndexError: list index out of range

Currently, I have solved the problem by returning the action position as a line. For the column, however, the value 0 is sufficient. This is the changed code:

def synthesize_rewrite_script(script):
    # Flatten the script: Build and parse the subtrees that are inserted or moved
    flat_script = flatten_script(script)

    # Generate new actions of the form (replace_span, token_seq)
    # You can transform the source by replacing each span with the token sequence 
    result = []
    for action in flat_script:
        target_node = action.target_node
        if isinstance(action, ops.Insert):
            if action.position == len(target_node.children):
                (start_line, start_pos), (end_line, end_pos) = target_node.position[1], target_node.position[1]
            elif len(target_node.children) >= action.position:
              predecessor = target_node.children[action.position]
              (start_line, start_pos), (end_line, end_pos) = predecessor.position[1], predecessor.position[1]
            else:
              # If there are no child nodes, consider use position as line (columns are not needed)
              (start_line, start_pos) = (end_line, end_pos) = (action.position, 0)

            result.append(((start_line, start_pos, end_line, end_pos), action.node))
        elif isinstance(action, ops.Update):
            (start_line, start_pos), (end_line, end_pos) = target_node.position
            result.append(((start_line, start_pos, end_line, end_pos), [action.value]))
            
        elif isinstance(action, ops.Delete):
            (start_line, start_pos), (end_line, end_pos) = target_node.position
            result.append(((start_line, start_pos, end_line, end_pos), []))
            
    return result

Now the output of the script is:

[((4, 0, 4, 0),
  ['List', 'list2', '=', 'new', 'ArrayList', '<', '>', '(', ')', ';']),
 ((2, 13, 2, 13), [ASTNode(type=binary_expression), '+', '8']),
 ((1, 7, 1, 10), ['var1']),
 ((2, 8, 2, 13), [])]

I have tried inserting lines in different positions and it seems that this modification generalizes the problem well.

The only problem I am experiencing now that I cannot solve is the following. As you can see, the action of modifying the line "String var1 = "Hello";" is reported to me at line 1 (instead of line 2), which is in the source code line. The same goes for the modification of "int x = x + 1 + 8;" which is reported to me at line 2 instead of line 1 (start and end columns are also wrong). Is it possible to report the actual line in the target code?

Devy99 avatar Jan 09 '24 15:01 Devy99