OpenHands Add TestGenEval benchmark

End-user friendly description of the problem this fixes or functionality that this introduces

Adds a new unit test generation benchmark TestGenEval: https://arxiv.org/abs/2410.00752

Give a summary of what the PR does, explaining any non-trivial design decisions

PR includes changes to measure:

Coverage
Mutation score
Push docker images for TestGenEval with testing dependencies
Prompts for measuring CodeAct performance
Wide range of lexical metrics too (rouge, codebleu, readability, etc)

Note: This is a clean version of PR #5534 that contains only the TestGenEval changes.

Dec 11 '24 22:12 kjain14

Hmm, I tried today and am not able to reproduce this, wondering what may be causing this?

I think this also does not have to do with testgeneval dependencies (it is because of the llama group dependencies, which list torch==2.5.1)

Dec 27 '24 18:12 kjain14

Hmm, I'll take another look.

Dec 28 '24 19:12 neubig

Sorry again this took me so long, but I'm looking at this now. I overcame my previous issue but encountered the problem below:

...
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/run_infer.py", line 118, in truncate_prompt
    encoding = tiktoken.encoding_for_model(model)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 105, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 92, in encoding_name_for_model
    raise KeyError(
KeyError: 'Could not automatically map openai/claude-3-5-sonnet-20241022 to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

This was due to prompt truncation. If this is necessary in OpenHands, I think it's something we should handle on the OpenHands side, not the benchmark side, so I removed the code for now and things seem to be working OK with Claude (although it failed on some instances). I'll update once I've run a full eval.

Feb 09 '25 12:02 neubig

OK, run_infer.py seems to be working, but I'm not sure about evaluation.

The README says to use ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh but this file does not exist, only ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh. @kjain14 , could you elaborate on how you ran evaluation?

Feb 09 '25 13:02 neubig

Hi @kjain14 , I think this is getting pretty close, but now I'm having an issue with codebleu:

poetry run python evaluation/benchmarks/testgeneval/eval_infer.py --eval-num-workers 1 --input-file evaluation/evaluation_outputs/outputs/ --dataset kjain14/testgenevallite --split test
/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.
  warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning)
Traceback (most recent call last):
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/eval_infer.py", line 22, in <module>
    from evaluation.benchmarks.testgeneval.metrics import (
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/metrics.py", line 305, in <module>
    "Java8": Evaluator("java"),
             ^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/Evaluator.py", line 38, in __init__
    self.parser_language = Language(this_dir / 'parser' / 'my-languages.so', lang)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py", line 132, in __init__
    self.lib = cdll.LoadLibrary(fspath(path_or_ptr))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: dlopen(/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so, 0x0006): tried: '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file)

Feb 10 '25 16:02 neubig

This should be fixed now (was being gitignored previously)

Feb 11 '25 11:02 kjain14

Just a thought about this addition:

Could we have codeBLEU library as a dependency in python?

In general, we have optional dependencies for evaluation in the poetry 'evaluation' group. Do you think it can be done that way?

Feb 11 '25 12:02 enyst

This is possible, but needs the tree-sitter version to be upgraded (is there a reason why it is pinned currently?)

Feb 13 '25 17:02 kjain14

I'm working on upgrading the tree sitter version!

Feb 16 '25 23:02 neubig

@kjain14 tree-sitter was updated in main, you may want to see if it works now?

Feb 17 '25 19:02 enyst

It seems like the codebleu package only works with a very specific version of tree-sitter (higher than the previous v0.21.0 but lower than the current version). Could we adjust it to work with this version (or alternatively can just use the code I have).

Looks like there is a PR to do this on the codebleu repo but no reponse: https://github.com/k4black/codebleu/pull/76

Because codebleu (0.7.0) depends on tree-sitter (>=0.22.0,<0.23.0)
 and no versions of codebleu match >0.7.0,<0.8.0, codebleu (>=0.7.0,<0.8.0) requires tree-sitter (>=0.22.0,<0.23.0).
So, because openhands-ai depends on both tree-sitter (>=0.24.0,<0.25.0) and codebleu (^0.7.0), version solving failed.

Feb 17 '25 19:02 kjain14

Hey @kjain14 , sorry this is taking so long, but maybe we could just remove the codebleu package? Looking at the paper it isn't even mentioned in the paper, so I'm guessing that it's not super-important?

Mar 07 '25 13:03 neubig

Sorry for the delay on this, I can remove the codebleu package.

Mar 11 '25 11:03 kjain14

Thank you!

Mar 12 '25 13:03 neubig

@openhands please do the following:

check the diff with the base branch and revert all changes outside of the evaluation/benchmarks/testgeneval/ directory
merge the main branch of the repo
remove the dependency on codebleu and any code that calculates codebleu while making minimal changes

Mar 17 '25 16:03 neubig

Openhands is working, @neubig can track my progress at all-hands.dev

Mar 17 '25 16:03 openhands-ai[bot]