Add TestGenEval benchmark
End-user friendly description of the problem this fixes or functionality that this introduces
Adds a new unit test generation benchmark TestGenEval: https://arxiv.org/abs/2410.00752
Give a summary of what the PR does, explaining any non-trivial design decisions
PR includes changes to measure:
- Coverage
- Mutation score
- Push docker images for TestGenEval with testing dependencies
- Prompts for measuring CodeAct performance
- Wide range of lexical metrics too (rouge, codebleu, readability, etc)
Note: This is a clean version of PR #5534 that contains only the TestGenEval changes.
Hmm, I tried today and am not able to reproduce this, wondering what may be causing this?
I think this also does not have to do with testgeneval dependencies (it is because of the llama group dependencies, which list torch==2.5.1)
Hmm, I'll take another look.
Sorry again this took me so long, but I'm looking at this now. I overcame my previous issue but encountered the problem below:
...
File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/run_infer.py", line 118, in truncate_prompt
encoding = tiktoken.encoding_for_model(model)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 105, in encoding_for_model
return get_encoding(encoding_name_for_model(model_name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 92, in encoding_name_for_model
raise KeyError(
KeyError: 'Could not automatically map openai/claude-3-5-sonnet-20241022 to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'
This was due to prompt truncation. If this is necessary in OpenHands, I think it's something we should handle on the OpenHands side, not the benchmark side, so I removed the code for now and things seem to be working OK with Claude (although it failed on some instances). I'll update once I've run a full eval.
OK, run_infer.py seems to be working, but I'm not sure about evaluation.
The README says to use ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh but this file does not exist, only ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh. @kjain14 , could you elaborate on how you ran evaluation?
Hi @kjain14 , I think this is getting pretty close, but now I'm having an issue with codebleu:
poetry run python evaluation/benchmarks/testgeneval/eval_infer.py --eval-num-workers 1 --input-file evaluation/evaluation_outputs/outputs/ --dataset kjain14/testgenevallite --split test
/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.
warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning)
Traceback (most recent call last):
File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/eval_infer.py", line 22, in <module>
from evaluation.benchmarks.testgeneval.metrics import (
File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/metrics.py", line 305, in <module>
"Java8": Evaluator("java"),
^^^^^^^^^^^^^^^^^
File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/Evaluator.py", line 38, in __init__
self.parser_language = Language(this_dir / 'parser' / 'my-languages.so', lang)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py", line 132, in __init__
self.lib = cdll.LoadLibrary(fspath(path_or_ptr))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 460, in LoadLibrary
return self._dlltype(name)
^^^^^^^^^^^^^^^^^^^
File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 379, in __init__
self._handle = _dlopen(self._name, mode)
^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: dlopen(/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so, 0x0006): tried: '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file)
This should be fixed now (was being gitignored previously)
Just a thought about this addition:
Could we have codeBLEU library as a dependency in python?
In general, we have optional dependencies for evaluation in the poetry 'evaluation' group. Do you think it can be done that way?
This is possible, but needs the tree-sitter version to be upgraded (is there a reason why it is pinned currently?)
I'm working on upgrading the tree sitter version!
@kjain14 tree-sitter was updated in main, you may want to see if it works now?
It seems like the codebleu package only works with a very specific version of tree-sitter (higher than the previous v0.21.0 but lower than the current version). Could we adjust it to work with this version (or alternatively can just use the code I have).
Looks like there is a PR to do this on the codebleu repo but no reponse: https://github.com/k4black/codebleu/pull/76
Because codebleu (0.7.0) depends on tree-sitter (>=0.22.0,<0.23.0)
and no versions of codebleu match >0.7.0,<0.8.0, codebleu (>=0.7.0,<0.8.0) requires tree-sitter (>=0.22.0,<0.23.0).
So, because openhands-ai depends on both tree-sitter (>=0.24.0,<0.25.0) and codebleu (^0.7.0), version solving failed.
Hey @kjain14 , sorry this is taking so long, but maybe we could just remove the codebleu package? Looking at the paper it isn't even mentioned in the paper, so I'm guessing that it's not super-important?
Sorry for the delay on this, I can remove the codebleu package.
Thank you!
@openhands please do the following:
- check the diff with the base branch and revert all changes outside of the
evaluation/benchmarks/testgeneval/directory - merge the main branch of the repo
- remove the dependency on codebleu and any code that calculates codebleu while making minimal changes
Openhands is working, @neubig can track my progress at all-hands.dev