OpenHands Add integration test framework with mock llm

This PR creates backend integration tests that rely on a mock LLM. It serves two purposes:

Ensure the quality of development, including OpenDevin control framework, agents, and sandboxes.
Help contributors (and users) learn the workflow of OpenDevin, and examples of real interactions with (powerful) LLM, without spending real money.

Why don't we launch an open-source model during the test, e.g. LLAMA3? There are two reasons:

LLMs cannot guarantee determinism, meaning the test behavior might change.
CI machines are not powerful enough to run any LLM that is sophisticated enough to finish the tasks defined in tests.

Note: integration tests are orthogonal to evaluations/benchmarks as they serve different purposes. Although benchmarks could also capture bugs, some of which may not be caught by tests, benchmarks require real LLMs which are non-deterministic and costly. We run integration test suite for every single commit, which is not possible with benchmarks.

Known limitations:

To avoid the potential impact of non-determinism, we remove all special characters and numbers (often used as PIDs) when doing the comparison. If two prompts for the same task & agent only differ in non-alpha characters, a wrong mock response might be picked up.
It is required that the agent and sandbox don't do anything non-deterministic in the tests, e.g. printing out the current date.

Apr 23 '24 08:04 li-boxuan

@xingyaoww I cannot get CodeActAgent working with this simple task: Write a shell script 'hello.sh' that prints 'hello'.. I am using GPT4-Turbo. I'll skip test for CodeActAgent in this PR but please let me know if there's some other config that I shall try to make it work.

Apr 24 '24 04:04 li-boxuan

@li-boxuan, thanks for the notice! That's completely fine with me! CodeActAgent on main is broken with some recent changes in architecture and is under heavy construction right now. A lot of changes are expected in the next few days. I can let you know when i finish it & can also help with these integration tests!

Apr 24 '24 04:04 xingyaoww

This is awesome! Thanks @li-boxuan

Apr 25 '24 14:04 rbren