lm-evaluation-harness icon indicating copy to clipboard operation
lm-evaluation-harness copied to clipboard

Add xnli

Open gentaiscool opened this issue 2 years ago • 9 comments

Adding xnli to lm-evaluation-harness

gentaiscool avatar Oct 07 '22 01:10 gentaiscool

I would love to be part of this conversation as well. Right now the multilingual modeling group is trying to perform evaluation on non-English tasks, and it seems like we have to forked both Eval-Harness and PromptSource to extend the prompt-based evaluation for non-EN tasks. Am I right?

yongzx avatar Oct 07 '22 06:10 yongzx

Hi @yongzx ! That's one way to do it. You'd have to:

  1. Fork this big-science/lm-evaluation-harness repo and set up the Python environment.
git clone https://github.com/{fork-name}/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e ".[dev]"
  1. Add the XNLI changes in this PR.
  2. Fork promptsource and work from the eval-hackathon branch here. (To tighten things up, you can later make this a submodule of lm-eval. See this harness fork that uses custom templates for a custom task).
pip uninstall promptsource  # Remove the version installed by the harness setup.
git clone --single-branch --branch eval-hackathon https://github.com/{fork-name}/promptsource
pip install -e ./promptsource
  1. Dump your prompt templates for the non-English subsets into the promptsource xnli template dir.

Lastly, make sure your templates can be accessed from the harness. For example, using the XNLI French subset, run the following in a Python interpreter:

import lm_eval
print(lm_eval.list_templates("xnli_fr"))

Once you see the templates listed, you should be ready to evaluate as usual.

Let me know if you run into any issues (most problems stem from setting up a consistent Python virtual environment). I'll be glad to help!

jon-tow avatar Oct 07 '22 06:10 jon-tow

@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?

StellaAthena avatar Oct 07 '22 19:10 StellaAthena

@jon-tow The Prompt Engineering WG has been successfully running non-English prompted tasks with the eval harness on BLOOM. @Muennighoff, can you explain how you’ve been running non-English prompts?

I didn't use eval harness, but https://github.com/Muennighoff/t-zero/blob/muennighoff/upgrdps/evaluation/run_eval.py

Muennighoff avatar Oct 07 '22 19:10 Muennighoff

Thanks @jon-tow and @Muennighoff!!

yongzx avatar Oct 12 '22 18:10 yongzx

@jon-tow I actually did what you suggested. For instance:

>>> import lm_eval
>>> print(lm_eval.list_templates("xnli_de"))
['GPT-3 style', 'MNLI crowdsource', 'always/sometimes/never', 'based on the previous passage', 'can we infer', 'claim true/false/inconclusive', 'consider always/sometimes/never', 'does it follow that', 'does this imply', 'guaranteed true', 'guaranteed/possible/impossible', 'justified in saying', 'must be true', 'should assume', 'take the following as truth']

yongzx avatar Oct 12 '22 18:10 yongzx

But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).

Will try with Niklas' repo.

yongzx avatar Oct 12 '22 18:10 yongzx

Thanks for the updates, @yongzx ! Did you obtain significantly different accuracies when using Niklas's repo? Re:

But strangely evaluating with xnli_en (English) using BLOOM_560m model on GPT-3 prompt gives 33.3% accuracy (as good as random).

jon-tow avatar Oct 13 '22 14:10 jon-tow

I obtained the same accuracies with the BLOOM model, but with Niklas' repo, I have gotten better accuracies using a different model (BLOOMZ). I haven't tried BLOOMZ with eval-harness yet.

yongzx avatar Oct 13 '22 14:10 yongzx