OpenHands [Experiment] Add symbol navigation commands into the editor

End-user friendly description of the problem this fixes or functionality that this introduces

[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

This PR is to:

[x] Integrate the two navigation commands into the editor: jump_to_definition and find_references (https://github.com/All-Hands-AI/openhands-aci/pull/5)
[x] Run eval

Link of any specific issues this addresses

Nov 17 '24 15:11 ryanhoangt

Eval results for the PR on a subset of swe-bench-lite:

Model	PR resolved	Baseline
`claude-3-5-sonnet-20241022`	35/59 - submitted instances: 59 - empty patch instances: 0 - resolved instances: 35 - unresolved instances: 24 - error instances: 0	35/59
`deepseek-chat`	8/59 - submitted instances: 59 - empty patch instances: 14 - resolved instances: 8 - unresolved instances: 51 - error instances: 0	7/59
`gpt-4o-2024-05-13` (without in-context example)	17/43 - submitted instances: 43 - empty patch instances: 5 - resolved instances: 17 - unresolved instances: 26 - error instances: 0	19/43

Some issues when I looked into the trajectories for gpt-4o:

A few instances got stuck because it couldn't include a longer context for old_str to ensure uniqueness. Some overcame that, but then got stuck when trying to fix wrong indentations of str_replace with new_str spanning multiple lines.
Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.
Other instances' trajectories looks normal to me, and I can't spot some obvious issues with it. One thing I notice is in 9/19 resolved instances of the baseline DOES encounter runtime error during execution, which IMO can somewhat contribute to the uncertainty and variation in the result.

Nov 22 '24 10:11 ryanhoangt

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case, but the usual differences in what the LLM "decides" to do are much higher than this.

Nov 22 '24 12:11 enyst

but the usual differences in what the LLM "decides" to do are much higher than this.

Yeah I agree. Although it's not desirable, sometimes just a small change in the action/output of a single step can lead to a completely different ending for the same instance.

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case

I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?

Nov 22 '24 13:11 ryanhoangt

Just to clarify, do you mean 17/43 vs 19/43 ? It's always worth to look into what happens, just in case

I think the simple approach is maybe to just compare resolve rate. If it's worse than baseline, then we should look into what happened. Can you clarify your suggestion a bit here?

Oh definitely. I would make a script to tell me what succeeded in one and failed in the other, both ways (it can happen), then look at the actual events for anything suspicious.

I'm not suggesting anything different, I'm saying that 2/43 is within the range of possible "normal" different outcomes. It's worth looking into, but it's not obvious it means "degradation". Same as for deepseek-chat 8/43 vs 7/43 is good to see, but it might be just random and not a "real" improvement.

Nov 22 '24 14:11 enyst

Just to note on this:

Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.

This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.

Nov 22 '24 14:11 enyst

Many instances got runtime error e.g. Unknown tool call: create/view..., which cause the whole trajectory to crash. This can be observed in both baseline and the PR. I'm not sure if we should do some retries here or catch it and inform the LLM.

This sounds like something we could catch and inform the LLM. If we can do that, it would be useful also outside evals, in regular use.

I believe this should be fixed in https://github.com/All-Hands-AI/OpenHands/pull/5113?

Nov 22 '24 15:11 xingyaoww

I believe this should be fixed in #5113?

Oh thanks! I was thinking I saw something, but I couldn't remember where in the code. Maybe the run was with a branch set to an older main than four days ago.

Nov 22 '24 15:11 enyst

Yeah seems like my PR didn't include that change unfortunately. Also thanks @enyst for the comment, that makes sense. We maybe able to tell more confidently with more instances run, but it's not very economical 🥹

Nov 22 '24 15:11 ryanhoangt

Took the chance to run a full eval on swe-bench-lite for claude -- fortunately we got a comparable performance with baseline v2.2 (130/300) and v2.1 in the leaderboard (125/300). At least it's not degrading performance a lot so maybe we can continue improving upon this with https://github.com/All-Hands-AI/openhands-aci/pull/19:

05:00:07 - openhands:INFO: eval_infer.py:418 - # resolved: 127 / 300. (42.33%) 05:00:07 - openhands:INFO: eval_infer.py:418 - # failed_apply_patch: 0 / 300. (0.00%) 05:00:07 - openhands:INFO: eval_infer.py:418 - # error_eval: 0 / 300. (0.00%) 05:00:07 - openhands:INFO: eval_infer.py:418 - # empty_generation: 3 / 300. (1.00%)

There're only over 20 instances where the agent called either the 2 new navigation commands (~7%), so it makes sense that the result is not a lot different. I'm gonna look more closely at the result and post a more detailed comparison here.

Dec 05 '24 05:12 ryanhoangt

Took a look at the result, I can't find any significant/interesting things for now, possibly due to the small difference in the result. Some plots:

Comparing v2.1, v2.2 and the PR: the difference between v2.1 and v2.2 seems pretty weird, I think we don't have much changes between 2 version.

`v2.1` vs. `v2.2`	`PR` vs. `v2.1`	`PR` vs. `v2.2`

Regarding instances where the 2 commands are called: there're 21 instances where either commands are called:

# instances with `jump_to_definition` vs. `find_references` called	# resolved instances in the 21-instance subset with 2 commands called for `PR` vs. `v2.2`

I thought about comparing resolved per difficulty as well, but there're 69/300 Lite instances without gold annotations and many resolved instances fall into this subset so the plots are not very interpretable.

diff-level-lite

Dec 06 '24 09:12 ryanhoangt

Going to assume this is still in progress?

Dec 23 '24 14:12 mamoodi

Yes, I'm working on a refactor and will circle back to this PR soon!

Dec 23 '24 14:12 ryanhoangt

Running eval and the result is not improving much. Given we have some other work with higher priority (e.g. model routing), I'll close this PR for now and circle back to this later.

Jan 14 '25 14:01 ryanhoangt