transformerlab-app
transformerlab-app copied to clipboard
Evals not showing grouping properly
- Run hellaswag,piqa,winogrande using common-eleuther-ai-lm-eval-harness-mlx
- see attached video: some rows have three reporting metrics, some two, some 1
https://github.com/user-attachments/assets/4079a238-87a6-4ff2-b0d7-c7d0840b6386
Adding a note here that this only happens when we do comparison with harness as each metric has its own test set which is varying in number. The actual task to solve for this is to determine which eval reports should be grouped and which shouldn't based on the plugin