OpenHands icon indicating copy to clipboard operation
OpenHands copied to clipboard

Add TestGenEval benchmark

Open kjain14 opened this issue 1 year ago • 14 comments

End-user friendly description of the problem this fixes or functionality that this introduces

Adds a new unit test generation benchmark TestGenEval: https://arxiv.org/abs/2410.00752


Give a summary of what the PR does, explaining any non-trivial design decisions

PR includes changes to measure:

  • Coverage
  • Mutation score
  • Push docker images for TestGenEval with testing dependencies
  • Prompts for measuring CodeAct performance
  • Wide range of lexical metrics too (rouge, codebleu, readability, etc)

Note: This is a clean version of PR #5534 that contains only the TestGenEval changes.

kjain14 avatar Dec 11 '24 22:12 kjain14

Hmm, I tried today and am not able to reproduce this, wondering what may be causing this?

I think this also does not have to do with testgeneval dependencies (it is because of the llama group dependencies, which list torch==2.5.1)

kjain14 avatar Dec 27 '24 18:12 kjain14

Hmm, I'll take another look.

neubig avatar Dec 28 '24 19:12 neubig

Sorry again this took me so long, but I'm looking at this now. I overcame my previous issue but encountered the problem below:

...
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/run_infer.py", line 118, in truncate_prompt
    encoding = tiktoken.encoding_for_model(model)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 105, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 92, in encoding_name_for_model
    raise KeyError(
KeyError: 'Could not automatically map openai/claude-3-5-sonnet-20241022 to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

This was due to prompt truncation. If this is necessary in OpenHands, I think it's something we should handle on the OpenHands side, not the benchmark side, so I removed the code for now and things seem to be working OK with Claude (although it failed on some instances). I'll update once I've run a full eval.

neubig avatar Feb 09 '25 12:02 neubig

OK, run_infer.py seems to be working, but I'm not sure about evaluation.

The README says to use ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh but this file does not exist, only ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh. @kjain14 , could you elaborate on how you ran evaluation?

neubig avatar Feb 09 '25 13:02 neubig

Hi @kjain14 , I think this is getting pretty close, but now I'm having an issue with codebleu:

poetry run python evaluation/benchmarks/testgeneval/eval_infer.py --eval-num-workers 1 --input-file evaluation/evaluation_outputs/outputs/ --dataset kjain14/testgenevallite --split test
/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.
  warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning)
Traceback (most recent call last):
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/eval_infer.py", line 22, in <module>
    from evaluation.benchmarks.testgeneval.metrics import (
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/metrics.py", line 305, in <module>
    "Java8": Evaluator("java"),
             ^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/Evaluator.py", line 38, in __init__
    self.parser_language = Language(this_dir / 'parser' / 'my-languages.so', lang)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py", line 132, in __init__
    self.lib = cdll.LoadLibrary(fspath(path_or_ptr))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: dlopen(/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so, 0x0006): tried: '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file)

neubig avatar Feb 10 '25 16:02 neubig

This should be fixed now (was being gitignored previously)

kjain14 avatar Feb 11 '25 11:02 kjain14

Just a thought about this addition: image

Could we have codeBLEU library as a dependency in python?

In general, we have optional dependencies for evaluation in the poetry 'evaluation' group. Do you think it can be done that way?

enyst avatar Feb 11 '25 12:02 enyst

This is possible, but needs the tree-sitter version to be upgraded (is there a reason why it is pinned currently?)

kjain14 avatar Feb 13 '25 17:02 kjain14

I'm working on upgrading the tree sitter version!

neubig avatar Feb 16 '25 23:02 neubig

@kjain14 tree-sitter was updated in main, you may want to see if it works now?

enyst avatar Feb 17 '25 19:02 enyst

It seems like the codebleu package only works with a very specific version of tree-sitter (higher than the previous v0.21.0 but lower than the current version). Could we adjust it to work with this version (or alternatively can just use the code I have).

Looks like there is a PR to do this on the codebleu repo but no reponse: https://github.com/k4black/codebleu/pull/76

Because codebleu (0.7.0) depends on tree-sitter (>=0.22.0,<0.23.0)
 and no versions of codebleu match >0.7.0,<0.8.0, codebleu (>=0.7.0,<0.8.0) requires tree-sitter (>=0.22.0,<0.23.0).
So, because openhands-ai depends on both tree-sitter (>=0.24.0,<0.25.0) and codebleu (^0.7.0), version solving failed.

kjain14 avatar Feb 17 '25 19:02 kjain14

Hey @kjain14 , sorry this is taking so long, but maybe we could just remove the codebleu package? Looking at the paper it isn't even mentioned in the paper, so I'm guessing that it's not super-important?

neubig avatar Mar 07 '25 13:03 neubig

Sorry for the delay on this, I can remove the codebleu package.

kjain14 avatar Mar 11 '25 11:03 kjain14

Thank you!

neubig avatar Mar 12 '25 13:03 neubig

@openhands please do the following:

  1. check the diff with the base branch and revert all changes outside of the evaluation/benchmarks/testgeneval/ directory
  2. merge the main branch of the repo
  3. remove the dependency on codebleu and any code that calculates codebleu while making minimal changes

neubig avatar Mar 17 '25 16:03 neubig

Openhands is working, @neubig can track my progress at all-hands.dev

openhands-ai[bot] avatar Mar 17 '25 16:03 openhands-ai[bot]