[Experiment] Add symbol navigation commands into the editor
End-user friendly description of the problem this fixes or functionality that this introduces
- [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
Give a summary of what the PR does, explaining any non-trivial design decisions
This PR is to:
- [x] Integrate the two navigation commands into the editor:
jump_to_definitionandfind_references(https://github.com/All-Hands-AI/openhands-aci/pull/5) - [x] Run eval
Link of any specific issues this addresses
Eval results for the PR on a subset of swe-bench-lite:
| Model | PR resolved | Baseline |
|---|---|---|
claude-3-5-sonnet-20241022 |
35/59 - submitted instances: 59 - empty patch instances: 0 - resolved instances: 35 - unresolved instances: 24 - error instances: 0 |
35/59 |
deepseek-chat |
8/59 - submitted instances: 59 - empty patch instances: 14 - resolved instances: 8 - unresolved instances: 51 - error instances: 0 |
7/59 |
gpt-4o-2024-05-13 (without in-context example) |
17/43 - submitted instances: 43 - empty patch instances: 5 - resolved instances: 17 - unresolved instances: 26 - error instances: 0 |
19/43 |
Some issues when I looked into the trajectories for gpt-4o:
- A few instances got stuck because it couldn't include a longer context for
old_strto ensure uniqueness. Some overcame that, but then got stuck when trying to fix wrong indentations ofstr_replacewithnew_strspanning multiple lines. - Many instances got runtime error e.g.
Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM. - Other instances' trajectories looks normal to me, and I can't spot some obvious issues with it. One thing I notice is in 9/19 resolved instances of the baseline DOES encounter runtime error during execution, which IMO can somewhat contribute to the uncertainty and variation in the result.
Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case, but the usual differences in what the LLM "decides" to do are much higher than this.
but the usual differences in what the LLM "decides" to do are much higher than this.
Yeah I agree. Although it's not desirable, sometimes just a small change in the action/output of a single step can lead to a completely different ending for the same instance.
Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case
I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?
Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case
I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?
Oh definitely. I would make a script to tell me what succeeded in one and failed in the other, both ways (it can happen), then look at the actual events for anything suspicious.
I'm not suggesting anything different, I'm saying that 2/43 is within the range of possible "normal" different outcomes. It's worth looking into, but it's not obvious it means "degradation". Same as for deepseek-chat 8/43 vs 7/43 is good to see, but it might be just random and not a "real" improvement.
Just to note on this:
Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.
This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.
Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.
This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.
I believe this should be fixed in https://github.com/All-Hands-AI/OpenHands/pull/5113?
I believe this should be fixed in #5113?
Oh thanks! I was thinking I saw something, but I couldn't remember where in the code. Maybe the run was with a branch set to an older main than four days ago.
Yeah seems like my PR didn't include that change unfortunately. Also thanks @enyst for the comment, that makes sense. We maybe able to tell more confidently with more instances run, but it's not very economical 🥹
Took the chance to run a full eval on swe-bench-lite for claude -- fortunately we got a comparable performance with baseline v2.2 (130/300) and v2.1 in the leaderboard (125/300). At least it's not degrading performance a lot so maybe we can continue improving upon this with https://github.com/All-Hands-AI/openhands-aci/pull/19:
05:00:07 - openhands:INFO: eval_infer.py:418 - # resolved: 127 / 300. (42.33%) 05:00:07 - openhands:INFO: eval_infer.py:418 - # failed_apply_patch: 0 / 300. (0.00%) 05:00:07 - openhands:INFO: eval_infer.py:418 - # error_eval: 0 / 300. (0.00%) 05:00:07 - openhands:INFO: eval_infer.py:418 - # empty_generation: 3 / 300. (1.00%)
There're only over 20 instances where the agent called either the 2 new navigation commands (~7%), so it makes sense that the result is not a lot different. I'm gonna look more closely at the result and post a more detailed comparison here.
Took a look at the result, I can't find any significant/interesting things for now, possibly due to the small difference in the result. Some plots:
- Comparing v2.1, v2.2 and the PR: the difference between v2.1 and v2.2 seems pretty weird, I think we don't have much changes between 2 version.
v2.1 vs. v2.2 |
PR vs. v2.1 |
PR vs. v2.2 |
|---|---|---|
- Regarding instances where the 2 commands are called: there're 21 instances where either commands are called:
# instances with jump_to_definition vs. find_references called |
# resolved instances in the 21-instance subset with 2 commands called for PR vs. v2.2 |
|---|---|
- I thought about comparing resolved per difficulty as well, but there're 69/300 Lite instances without gold annotations and many resolved instances fall into this subset so the plots are not very interpretable.
Going to assume this is still in progress?
Yes, I'm working on a refactor and will circle back to this PR soon!
Running eval and the result is not improving much. Given we have some other work with higher priority (e.g. model routing), I'll close this PR for now and circle back to this later.