Ryan H. Tran
Ryan H. Tran
Yep sounds good, I can do that!
Eval results for the PR on a subset of swe-bench-lite: | Model | PR resolved | Baseline | | ------ | ------------ | -------- | | `claude-3-5-sonnet-20241022` | *35/59* -...
> but the usual differences in what the LLM "decides" to do are much higher than this. Yeah I agree. Although it's not desirable, sometimes just a small change in...
Yeah seems like my PR didn't include that change unfortunately. Also thanks @enyst for the comment, that makes sense. We maybe able to tell more confidently with more instances run,...
Took the chance to run a full eval on `swe-bench-lite` for claude -- fortunately we got a comparable performance with baseline v2.2 (130/300) and v2.1 in the leaderboard (125/300). At...
Took a look at the result, I can't find any significant/interesting things for now, possibly due to the small difference in the result. Some plots: - Comparing v2.1, v2.2 and...
Yes, I'm working on a refactor and will circle back to this PR soon!
Running eval and the result is not improving much. Given we have some other work with higher priority (e.g. model routing), I'll close this PR for now and circle back...
The issue in trajectory (1) is from a bug in the `aci` -- I made a fix for it [here](https://github.com/All-Hands-AI/openhands-aci/pull/15/commits/3a5655d0eede026ed9c8299b71dbe0264fa8ac4f). Not too sure what happened with the other 2 trajectories
Can you check the logs in terminal to see what errors happened that caused the state to change? You can also set `export DEBUG=1` to have more details visible.