OpenHands refactor: Replace pexpect with libtmux in BashSession

Simplified implementation using libtmux instead of pexpect
Added proper handling of command errors, interactive commands, and timeouts
Added test suite to verify behavior
Improved output handling and error detection

End-user friendly description of the problem this fixes or functionality that this introduces

[ ] Include this change in the Release Notes. If checked, you must provide an end-user friendly description for your change below

Give a summary of what the PR does, explaining any non-trivial design decisions

Collaborated with OpenHands:

Write tests for PS1 parsing https://www.all-hands.dev/share?share_id=41932d06ced712b1112e880ad3d3ed757308205438e0a6c6fd86048f22a5511d

Link of any specific issues this addresses

To run this PR locally, use the following command:

docker run -it --rm   -p 3000:3000   -v /var/run/docker.sock:/var/run/docker.sock   --add-host host.docker.internal:host-gateway   -e SANDBOX_RUNTIME_CONTAINER_IMAGE=docker.all-hands.dev/all-hands-ai/runtime:eb27320-nikolaik   --name openhands-app-eb27320   docker.all-hands.dev/all-hands-ai/openhands:eb27320

Nov 10 '24 19:11 xingyaoww

Wow, this is really happening! Thank you so much! With tmux, I suspect it will be easier to support "multi-terminals" like what Devin has, and have higher fidelity terminal "screen" captures to render in frontend.

Nov 15 '24 17:11 diwu-sf

We got around 41% on SWE-Bench Lite (well, in my first run), then I fix one minor issue and retry, it got us around 40%.

Running SWE-Bench verify to validate 🏃

Dec 20 '24 15:12 xingyaoww

68 files changed seem overwhelming but a lot of that is due to removing command_id and keep_prompt from CmdRunAction and the refactor of exceptions. I love the introduction of an exception hierarchy. 🙌

Questions:

Do you foresee changes in build times due to installing tmux?
Are we using alpha versions as a way of simplifying dependencies? (0.15.3a0)
Can you explain the remote_runtime_resource_factor a little bit more to me?

Dec 20 '24 15:12 tofarr

We got around 41% on SWE-Bench Lite (well, in my first run), then I fix one minor issue and retry, it got us around 40%.

Running SWE-Bench verify to validate 🏃

I wonder what were the empty generations? I notice you used the proxy with the general model, and it's possible it will fall into a weird "not found" error we've seen before. I think I've seen that results in empty patches.🤔

Dec 20 '24 20:12 enyst

@enyst i think that should be exposed as an .error attribute, right? I didn't really see smth like that 🤔

Dec 20 '24 21:12 xingyaoww

I wonder what were the empty generations? I notice you used the proxy with the general model, and it's possible it will fall into a weird "not found" error we've seen before. I think I've seen that results in empty patches.🤔

I think another possibility is when the agent faced issues reproducing the error and poked around with the reproduce script, and that script was in the /workspace instead of /workspace/xxx, which is where we collect diff.

Dec 21 '24 05:12 ryanhoangt

@enyst i think that should be exposed as an .error attribute, right? I didn't really see smth like that 🤔

I've seen it not reported as error in a recent eval... I'll have to verify what's up, it might be just the summary script that missed it as error. The completions of the eval, any of those with empty patch, would show that error.

Dec 25 '24 03:12 enyst

Finally figured out the resolve rate degradation issue! Now we get comparable performance on SWE-Bench Verified: 238 (tmux, this PR) vs 241 (reproduce-commit from 53% verified) resolved.

I will clean up the code and try to get this finally merged :D

Jan 03 '25 15:01 xingyaoww

Very excited to finally merge this almost 2-month-old PR with nearly 300 commits 😄

attach a screenshot showing this is working

Jan 03 '25 21:01 xingyaoww

Congrats on landing this one!

Jan 03 '25 21:01 diwu-sf