multilingual-modeling Add XLSum evaluation / unify eval script

Submitting a PR from fork because I may not have edit access to this repo.

In this PR: added adapters_eval.py , a script that can be used to evaluate on XLSum or XNLI based on the 'dataset' flag. Also working on adding deepspeed compatibility via Huggingface Trainer / command line.

TODO/needs checking:

rouge compute_metrics function could be wrong. I will try to check this
make sure the logic within load_model for setting adapters to train / adding adapters is correct.
Has the FIXME in adapters_xnli_de.py been dealt with?

May 09 '22 19:05 haileyschoelkopf

Thanks Hailey!

(Referring to #11) Will resolve this PR once Vassilina and I have finalized on our evaluation script on XNLI. Apologies for the delay.

May 11 '22 04:05 yongzx

The remaining TODOs for this script are:

the logic for loading adapters in load_model needs to be checked (it was unclear to me whether the XNLI script's logic was correct or if it was still a work in progress.)
Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed.
EDIT: And also, what to set for max_generation_length in prediction and max_length in tokenization.

May 13 '22 15:05 haileyschoelkopf

Apologies for reviewing this PR late. I have made some comments, but at the end, I think I will create another PR based on your committed files, and request Vassilina's and your review again.

Please don't push any changes if that's okay.

Edit: Commenting on this PR for the to-dos of integration:

use hugginface evaluate library
test run the code.

Jun 01 '22 10:06 yongzx

Which ROUGE metrics to report? Currently reporting Fmeasure for all rouge metrics, but if precision and recall are desired, can be easily changed. And also, what to set for max_generation_length in prediction and max_length in tokenization.

From the paper, "Due to computational constraints, we used the base model (600M parameters) and had to truncate the inputs to 512 tokens and the outputs to 64 tokens. We used the ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004) scores for automatic evaluation. For inference, we used beam search with beam size 4 and length penalty of α = 0.6 (Wu et al., 2016)."

Jun 01 '22 11:06 yongzx

@haileyschoelkopf Can you help review b0a23c5? Thank you! I've tested it and the training and evaluation (on baseline BLOOM and GPT2 models) are working. The only minor issue is that the evaluation that uses model.generate takes quite long (even for num_beams = 1).

Jul 07 '22 14:07 yongzx

Yes I can! I might only get to it tomorrow though

Jul 07 '22 14:07 haileyschoelkopf

multilingual-modeling multilingual-modeling copied to clipboard

Add XLSum evaluation / unify eval script

multilingual-modeling
multilingual-modeling copied to clipboard