fix(backend) changes to improve Command-R+ behavior, plus file i/o error improvements.
With these changes Command-R+ runs alright. The prompt changes should also help other llms.
The absolute path changes have been removed, instead fetching the current sandbox path when calculating paths for fileio actions. I also added a "SANDBOX_TIMEOUT" configuration that is shared with the LLM in it's prompt. This way the agent is expecting commands to timeout if they execute for too long.
Looks like the integration test failure is due to the prompt messages changing.
@computer-whisperer Glad to see you fixed the integration tests after the prompt changes. Did you encounter any difficulty? Did the README doc help? Anything you think could be improved, including but not limited to doc?
It was definitely a pain to diagnose and fix as-is. I had to get the tests running locally with a debugger before I realized what was going wrong, and I had to manually edit all of the prompt_00x.log files to match the new prompt notes. I think at the very least the error should be a lot more informative about why it can't find a prompt response at the very least.
I also don't like how making little changes to the agent prompt text can trigger the need to go hunt down and edit many different test case .log files, especially if you intend the system to be expanded with many more test sequences. Maybe a script to regenerate them all using an LLM would help? I'm not sure.
@computer-whisperer Did you get a chance to read https://github.com/OpenDevin/OpenDevin/blob/main/tests/integration/README.md?
You should be able to do
poetry run python ./opendevin/main.py -i 10 -t "Write a shell script 'hello.sh' that prints 'hello'." -c "MonologueAgent" -d "./workspace"
and simply replace the test folder with new logs generated under logs folder. This process, will, however, be more complicated when we have more tests.
It shouldn't take more than a few minutes to do so. Manually editing prompt_00x.log files is definitely a huge pain.
Initially I tried using a local vector DB to mock LLM, so that tiny changes to prompts shouldn't require any changes to test files. That didn't work well. The vector DB wasn't smart enough to retrieve the correct response.
I think going forward, when we have more tests, we need a script to run poetry run python ./opendevin/main.py -i 10 -t "<task>" -c <agent> -d "./workspace" in a batch.
@li-boxuan it would be awesome if you could run something like
REGENERATE_TEST_FILES=true poetry run python ./opendevin/main.py -i 10 -t "Write a shell script 'hello.sh' that prints 'hello'." -c "MonologueAgent" -d "./workspace"