OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Automate regression test checking

Open rbren opened this issue 1 year ago • 5 comments

What problem or use case are you trying to solve? We can run the regression tests with ./evaluation/regression/run.sh. But it's hard to tell how each agent does, and if it accomplishes the task.

Describe the UX of the solution you'd like I'd like to see a score of how many tests it accomplished successfully

Do you have thoughts on the technical implementation? We should add a test.sh to each test case, and expect it to exit 0.

Each agent should then get a score of how many tests it passed.

Describe alternatives you've considered

Additional context

rbren avatar Mar 27 '24 21:03 rbren