multilingual-modeling
multilingual-modeling copied to clipboard
Add XLSum evaluation / unify eval script
Submitting a PR from fork because I may not have edit access to this repo.
In this PR: added adapters_eval.py , a script that can be used to evaluate on XLSum or XNLI based on the 'dataset' flag.
Also working on adding deepspeed compatibility via Huggingface Trainer / command line.
TODO/needs checking:
- rouge
compute_metricsfunction could be wrong. I will try to check this - make sure the logic within
load_modelfor setting adapters to train / adding adapters is correct. - Has the FIXME in
adapters_xnli_de.pybeen dealt with?
Thanks Hailey!
(Referring to #11) Will resolve this PR once Vassilina and I have finalized on our evaluation script on XNLI. Apologies for the delay.
The remaining TODOs for this script are:
- the logic for loading adapters in
load_modelneeds to be checked (it was unclear to me whether the XNLI script's logic was correct or if it was still a work in progress.) - Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed.
- EDIT: And also, what to set for
max_generation_lengthin prediction andmax_lengthin tokenization.
Apologies for reviewing this PR late. I have made some comments, but at the end, I think I will create another PR based on your committed files, and request Vassilina's and your review again.
Please don't push any changes if that's okay.
Edit: Commenting on this PR for the to-dos of integration:
- use hugginface
evaluatelibrary - test run the code.
Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed. And also, what to set for max_generation_length in prediction and max_length in tokenization.
From the paper, "Due to computational constraints, we used the base model (600M parameters) and had to truncate the inputs to 512 tokens and the outputs to 64 tokens. We used the ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) scores for automatic evaluation. For inference, we used beam search with beam size 4 and length penalty of α = 0.6 (Wu et al., 2016)."
@haileyschoelkopf Can you help review b0a23c5? Thank you!
I've tested it and the training and evaluation (on baseline BLOOM and GPT2 models) are working. The only minor issue is that the evaluation that uses model.generate takes quite long (even for num_beams = 1).
Yes I can! I might only get to it tomorrow though