lm-evaluation-harness
lm-evaluation-harness copied to clipboard
Add xnli
Adding xnli to lm-evaluation-harness
I would love to be part of this conversation as well. Right now the multilingual modeling group is trying to perform evaluation on non-English tasks, and it seems like we have to forked both Eval-Harness and PromptSource to extend the prompt-based evaluation for non-EN tasks. Am I right?
Hi @yongzx ! That's one way to do it. You'd have to:
- Fork this
big-science/lm-evaluation-harness
repo and set up the Python environment.
git clone https://github.com/{fork-name}/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"
- Add the XNLI changes in this PR.
- Fork
promptsource
and work from theeval-hackathon
branch here. (To tighten things up, you can later make this a submodule of lm-eval. See this harness fork that uses custom templates for a custom task).
pip uninstall promptsource # Remove the version installed by the harness setup.
git clone --single-branch --branch eval-hackathon https://github.com/{fork-name}/promptsource
pip install -e ./promptsource
- Dump your prompt templates for the non-English subsets into the
promptsource
xnli
template dir.
Lastly, make sure your templates can be accessed from the harness. For example, using the XNLI French subset, run the following in a Python interpreter:
import lm_eval
print(lm_eval.list_templates("xnli_fr"))
Once you see the templates listed, you should be ready to evaluate as usual.
Let me know if you run into any issues (most problems stem from setting up a consistent Python virtual environment). I'll be glad to help!
@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?
@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?
I didn't use eval harness, but https://github.com/Muennighoff/t-zero/blob/muennighoff/upgrdps/evaluation/run_eval.py
Thanks @jon-tow and @Muennighoff!!
@jon-tow I actually did what you suggested. For instance:
>>> import lm_eval
>>> print(lm_eval.list_templates("xnli_de"))
['GPT-3 style', 'MNLI crowdsource', 'always/sometimes/never', 'based on the previous passage', 'can we infer', 'claim true/false/inconclusive', 'consider always/sometimes/never', 'does it follow that', 'does this imply', 'guaranteed true', 'guaranteed/possible/impossible', 'justified in saying', 'must be true', 'should assume', 'take the following as truth']
But strangely evaluating with xnli_en
(English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).
Will try with Niklas' repo.
Thanks for the updates, @yongzx ! Did you obtain significantly different accuracies when using Niklas's repo? Re:
But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).
I obtained the same accuracies with the BLOOM model, but with Niklas' repo, I have gotten better accuracies using a different model (BLOOMZ). I haven't tried BLOOMZ with eval-harness yet.