refactor: Replace pexpect with libtmux in BashSession
- Simplified implementation using libtmux instead of pexpect
- Added proper handling of command errors, interactive commands, and timeouts
- Added test suite to verify behavior
- Improved output handling and error detection
End-user friendly description of the problem this fixes or functionality that this introduces
- [ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below
Give a summary of what the PR does, explaining any non-trivial design decisions
Collaborated with OpenHands:
- Write tests for PS1 parsing https://www.all-hands.dev/share?share_id=41932d06ced712b1112e880ad3d3ed757308205438e0a6c6fd86048f22a5511d
Link of any specific issues this addresses
To run this PR locally, use the following command:
docker run -it --rm -p 3000:3000 -v /var/run/docker.sock:/var/run/docker.sock --add-host host.docker.internal:host-gateway -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:eb27320-nikolaik --name openhands-app-eb27320 docker.all-hands.dev/all-hands-ai/openhands:eb27320
Wow, this is really happening! Thank you so much! With tmux, I suspect it will be easier to support "multi-terminals" like what Devin has, and have higher fidelity terminal "screen" captures to render in frontend.
We got around 41% on SWE-Bench Lite (well, in my first run), then I fix one minor issue and retry, it got us around 40%.
Running SWE-Bench verify to validate 🏃
68 files changed seem overwhelming but a lot of that is due to removing command_id and keep_prompt from CmdRunAction and the refactor of exceptions. I love the introduction of an exception hierarchy. 🙌
Questions:
- Do you foresee changes in build times due to installing tmux?
- Are we using alpha versions as a way of simplifying dependencies? (0.15.3a0)
- Can you explain the remote_runtime_resource_factor a little bit more to me?
We got around 41% on SWE-Bench Lite (well, in my first run), then I fix one minor issue and retry, it got us around 40%.
Running SWE-Bench verify to validate 🏃
I wonder what were the empty generations? I notice you used the proxy with the general model, and it's possible it will fall into a weird "not found" error we've seen before. I think I've seen that results in empty patches.🤔
@enyst i think that should be exposed as an .error attribute, right? I didn't really see smth like that 🤔
I wonder what were the empty generations? I notice you used the proxy with the general model, and it's possible it will fall into a weird "not found" error we've seen before. I think I've seen that results in empty patches.🤔
I think another possibility is when the agent faced issues reproducing the error and poked around with the reproduce script, and that script was in the /workspace instead of /workspace/xxx, which is where we collect diff.
@enyst i think that should be exposed as an
.errorattribute, right? I didn't really see smth like that 🤔
I've seen it not reported as error in a recent eval... I'll have to verify what's up, it might be just the summary script that missed it as error. The completions of the eval, any of those with empty patch, would show that error.
Finally figured out the resolve rate degradation issue! Now we get comparable performance on SWE-Bench Verified: 238 (tmux, this PR) vs 241 (reproduce-commit from 53% verified) resolved.
I will clean up the code and try to get this finally merged :D
Very excited to finally merge this almost 2-month-old PR with nearly 300 commits 😄
attach a screenshot showing this is working
Congrats on landing this one!