lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Evaluation of encoder and decoder models on SuperGLUE
Hi guys,
I want to evaluate models like ModernBERT, Llama and many others on SuperGLUE and my own benchmark. In my setting, every model has to be fine-tuned for the specific task, even decoder models.
Is this currently supported by Harness? Looking at the code, my feeling is that evaluations are only done by prompting.
Thanks.