aider icon indicating copy to clipboard operation
aider copied to clipboard

gpt-4-turbo refactoring benchmark questions

Open few opened this issue 1 year ago • 1 comments

After reading https://aider.chat/2024/04/09/gpt-4-turbo.html I was wondering if there is a way to improve the new model's score in the refactoring benchmark. I executed a small number of the tests and they all failed. But not in a way I had expected (i.e. "# rest of the code remains unchanged"). They all failed in the same way. I created this minimal testcase to reproduce the failure:

Given a class that looks like this:

class MyClass:
    def f1(self):
        self.x = self.f2()
        self.y = self.f3()
    
    def f2(self):
        return 1

    def f3(self):
        return 2

gpt-4-turbo would generate a valid diff that results in this code:

class MyClass:
    def f1(self):
        self.x = f2()
        self.y = self.f3()
    
def f2(self):
    return 1

    def f3(self):
        return 2

It simply out indents f2, ignoring the fact that now f3 no longer belongs to MyClass. If the model is given the code and the task without aider's prompt, it generate the correct code from which I concluded that the model understands the task, but is unable to express it in the udiff format.

I then gave the model a diff generate by another model, that produces the correct output, and asked it to revise the prompt. It did (basically it simplified aider's prompt) and with the new prompt gpt-4-turbo was able to correctly solve the minimal test case. But it still failed on the larger cases. The simplified prompt in coders/udiff_prompts.py was:

system_reminder = """# File editing rules:
    
Return edits similar to unified diffs that `diff -U0` would produce, including:
- The first 2 lines with the file paths without timestamps.
- Start each hunk of changes with `@@ ... @@`.
- Mark all lines that need to be removed or changed with`-`and all new or modified lines with`+`.
- Indentation matters in the diffs.
- Start a new hunk for each section of the file that needs changes.
- Output hunks in whatever order makes the most sense.
- When editing a function, method, loop, etc., use a hunk to replace the entire code block.
- To move code within a file, use 2 hunks: one to delete it from its current location, one to insert it in the new location.
- To make a new file, show a diff from `--- /dev/null` to `+++ path/to/new/file.ext`.
"""

My questions:

  1. Did you publish the raw benchmark results somewhere? If no, would you consider it? I'd like to look at more of the failures and check what the causes are.
  2. Assuming a large part of the failures are caused by the problem described above, do you think it's worthwhile to explore different prompts to "fix" these failures? Maybe you already have some idea what prompting technique to try to make the model better understand the udiff format.

few avatar Apr 14 '24 18:04 few

Thanks for trying aider and filing this issue.

I haven't published all the benchmark transcripts. But you can replicate them yourself, see the README:

https://github.com/paul-gauthier/aider/blob/main/benchmark/README.md

Yes, there is probably some prompt tuning that can help the newest GPT-4 turbo. I'm looking into it.

paul-gauthier avatar Apr 15 '24 20:04 paul-gauthier

I'm going to close this for now, given that GPT-4o seems better across the board. But feel free to add a comment here and I will re-open or file a new issue any time.

paul-gauthier avatar May 23 '24 21:05 paul-gauthier