pythia
pythia copied to clipboard
Questions regarding the WSC evaluation results
Hi,
I'm recently trying to run lm-eval on Pythia models using the benchmarks listed in the paper. All the benchmarks show similar results to those reported in the paper, except WSC. In the paper the Pythia models report a WSC score of 0.3~0.5, while the models can easily get 0.6~0.8 accuracy on the WSC273 task from lm-eval. May I confirm what is the WSC task reported in the paper and how is it evaluated?
Thanks!