LongBench
LongBench copied to clipboard
Why are empty responses ignored in LongBench v2?
We noticed that in pred.py (lines 98-99), empty responses are ignored and not included in the final score. Is this approach reasonable? We are concerned that models might exploit this by simply not responding to questions they are unsure about.
Hi! Empty responses only occur when an exception is raised during model calls, as seen here: https://github.com/THUDM/LongBench/blob/main/pred.py#L54. During evaluation, models always output some response, even when unsure, and never return an empty string.