[BUG] MATH eval is calculated with non-greedy sampling
Describe the bug
Two metrics are reported for MATH, Metrics.quasi_exact_match_math and Metrics.maj_at_4_math, the first metric should be calculated using greedy generation, whereas the second with sampling and majority voting. However, both metrics are in the categories MetricCategory.GENERATIVE or MetricCategory.GENERATIVE_SAMPLING which means that when the requests are built for this task the two metrics are grouped into the same category in the construct_requests method:
if (
self.has_metric_category[MetricCategory.GENERATIVE_SAMPLING]
or self.has_metric_category[MetricCategory.GENERATIVE]
or self.has_metric_category[MetricCategory.GENERATIVE_LOGPROB]
):
# All these tasks require the same generation process - we can do them in one step
use_logits = self.has_metric_category[MetricCategory.GENERATIVE_LOGPROB]
requests[RequestType.GREEDY_UNTIL] += [
GreedyUntilRequest(
task_name=current_task_name,
sample_index=document_id_seed,
request_index=0,
context=context,
stop_sequence=self.stop_sequence,
generation_size=self.generation_size,
generation_grammar=self.generation_grammar,
num_samples=max(self.num_samples), # If we have several samplings to apply, we use the max
use_logits=use_logits,
metric_categories=[
c
for c in [
MetricCategory.GENERATIVE_SAMPLING,
MetricCategory.GENERATIVE,
MetricCategory.GENERATIVE_LOGPROB,
]
if self.has_metric_category[c]
],
)
]
To Reproduce
lighteval accelerate --model_args=pretrained=Qwen/Qwen2.5-Math-1.5B-instruct,dtype=float16,use_chat_template=True --tasks=lighteval|math_cot:algebra|0|0 --output_dir=./evals/math_cot
I added some breakpoints and looked at the requests, generations, etc.
Expected behavior
I would expect separate requests to be generated for these two metrics, rather than grouping GENERATIVE_SAMPLING and GENERATIVE, it may be as trivial as separating these into two if statements. I am unsure of the implications on other tasks.
cc @lewtun
Hi!
Grouping the request when building them is not a bug, as it allows us to run the inference only max_samples times, not max_samples +1 times.
So for quasi_exact_match, we keep only the first sample (which is the one we would have gotten with a greedy sampling normally), and for maj at 4, we keep samples 1 to 4 for metric computations.
Did you observe that the quasi_exact_match was computed on 4 samples? (If yes, then we have a bug)
Hi @clefourrier , the first sample of 4 is computed using sampling with a temperature, rather than greedy?
Ok got you! Yep, checked the code + ran a small xp and indeed sampling with 1 sample does not give the same results as greedy generation (which is obvious a posteriori) :sweat_smile: I'll fix this asap, thanks for the report!
Thanks, I think also with the vllm backend sampling with t=1.0 is used, even when num_samples==1, in that case it should default to greedy to match the transformers backend (I believe), I am just confirming but I will make another issue for that.
Thanks a lot!