lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Support empty response for Completions and ChatCompletions API
This PR fixes a bug where evaluation will fail if any of the text/content responses is None from the API under test.
This happens because the parse_generations functions say that they return List[str], but will actually return a list[Optional[str]] when the response from the API is None.
This error scenario is more likely with reasoning models.
Some frameworks put the reasoning tokens in a different field, and without a high enough max_gen_toks the entire token budget will be spent on reasoning tokens. This leaves the text/content field as None.
I've hit this error scenario when testing gpt-oss-20b on vLLM with the gsm8k_cot_llama dataset.
Thanks for the PR! Left a comment to add warning in case an empty response is unexpected. Also if you could run the pre-commit for the formatting:
pip install pre-commit
pre-commit run --all-files
``
Thanks for the feedback. I've added a warning, tested that it's being emitted, and also ran the pre-commit run --all-files .
Hey, nice to see this being fixed, we already bumped into this issue before and we patched it similarly.
Our approaches differ slightly, I would consider not using an empty string in case of None answer (exhausted tokens during reasoning) because a None can be produced after/during the response filtering. We used a fallback string ([invalidanswer]) intimating what we found in some regex extractions.
see https://github.com/EleutherAI/lm-evaluation-harness/issues/3244
Hey, nice to see this being fixed, we already bumped into this issue before and we patched it similarly.
Our approaches differ slightly, I would consider not using an empty string in case of
Noneanswer (exhausted tokens during reasoning) because aNonecan be produced after/during the response filtering. We used a fallback string ([invalidanswer]) intimating what we found in some regex extractions.see #3244
Great, I would be happy to see this fixed one way or another.