Evals not showing grouping properly

Open aliasaria opened this issue 5 months ago • 1 comments

Run hellaswag,piqa,winogrande using common-eleuther-ai-lm-eval-harness-mlx
see attached video: some rows have three reporting metrics, some two, some 1

https://github.com/user-attachments/assets/4079a238-87a6-4ff2-b0d7-c7d0840b6386

Jul 10 '25 18:07 aliasaria

Adding a note here that this only happens when we do comparison with harness as each metric has its own test set which is varying in number. The actual task to solve for this is to determine which eval reports should be grouped and which shouldn't based on the plugin

Jul 10 '25 18:07 deep1401