Add integration test framework with mock llm
This PR creates backend integration tests that rely on a mock LLM. It serves two purposes:
- Ensure the quality of development, including OpenDevin control framework, agents, and sandboxes.
- Help contributors (and users) learn the workflow of OpenDevin, and examples of real interactions with (powerful) LLM, without spending real money.
Why don't we launch an open-source model during the test, e.g. LLAMA3? There are two reasons:
- LLMs cannot guarantee determinism, meaning the test behavior might change.
- CI machines are not powerful enough to run any LLM that is sophisticated enough to finish the tasks defined in tests.
Note: integration tests are orthogonal to evaluations/benchmarks as they serve different purposes. Although benchmarks could also capture bugs, some of which may not be caught by tests, benchmarks require real LLMs which are non-deterministic and costly. We run integration test suite for every single commit, which is not possible with benchmarks.
Known limitations:
- To avoid the potential impact of non-determinism, we remove all special characters and numbers (often used as PIDs) when doing the comparison. If two prompts for the same task & agent only differ in non-alpha characters, a wrong mock response might be picked up.
- It is required that the agent and sandbox don't do anything non-deterministic in the tests, e.g. printing out the current date.
@xingyaoww I cannot get CodeActAgent working with this simple task: Write a shell script 'hello.sh' that prints 'hello'.. I am using GPT4-Turbo. I'll skip test for CodeActAgent in this PR but please let me know if there's some other config that I shall try to make it work.
@li-boxuan, thanks for the notice! That's completely fine with me! CodeActAgent on main is broken with some recent changes in architecture and is under heavy construction right now. A lot of changes are expected in the next few days. I can let you know when i finish it & can also help with these integration tests!
This is awesome! Thanks @li-boxuan