lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Support empty response for Completions and ChatCompletions API

Open tboerstad opened this issue 2 months ago • 3 comments

This PR fixes a bug where evaluation will fail if any of the text/content responses is None from the API under test. This happens because the parse_generations functions say that they return List[str], but will actually return a list[Optional[str]] when the response from the API is None.

This error scenario is more likely with reasoning models. Some frameworks put the reasoning tokens in a different field, and without a high enough max_gen_toks the entire token budget will be spent on reasoning tokens. This leaves the text/content field as None.

I've hit this error scenario when testing gpt-oss-20b on vLLM with the gsm8k_cot_llama dataset.

tboerstad avatar Sep 22 '25 22:09 tboerstad

CLA assistant check
All committers have signed the CLA.

CLAassistant avatar Sep 22 '25 22:09 CLAassistant

Thanks for the PR! Left a comment to add warning in case an empty response is unexpected. Also if you could run the pre-commit for the formatting:

pip install pre-commit
pre-commit run --all-files
``

baberabb avatar Oct 02 '25 20:10 baberabb

Thanks for the feedback. I've added a warning, tested that it's being emitted, and also ran the pre-commit run --all-files .

tboerstad avatar Oct 03 '25 09:10 tboerstad

Hey, nice to see this being fixed, we already bumped into this issue before and we patched it similarly.

Our approaches differ slightly, I would consider not using an empty string in case of None answer (exhausted tokens during reasoning) because a None can be produced after/during the response filtering. We used a fallback string ([invalidanswer]) intimating what we found in some regex extractions.

see https://github.com/EleutherAI/lm-evaluation-harness/issues/3244

RawthiL avatar Nov 24 '25 18:11 RawthiL

Hey, nice to see this being fixed, we already bumped into this issue before and we patched it similarly.

Our approaches differ slightly, I would consider not using an empty string in case of None answer (exhausted tokens during reasoning) because a None can be produced after/during the response filtering. We used a fallback string ([invalidanswer]) intimating what we found in some regex extractions.

see #3244

Great, I would be happy to see this fixed one way or another.

tboerstad avatar Nov 24 '25 20:11 tboerstad