The tokenizer add_special_tokens parameter for t5 model lambada task
When we run lambada_openai on google/flan-t5-xl, both input token and labels are end with EOS because by default add_special_tokens=True for Seq2Seq model, however the output of the model_call does not have EOS and the accuracy is always 0. As lambada dataset input is not a full sentence, can we set add_special_tokens=False to run lambada for t5 models? Or please help to suggest how to get correct result on lambada task for t5 models.
If you have a reference data for google/flan-t5-xl lambada_openai, please kindly share with me. Thanks!
task_dict = tasks.get_task_dict(task_names)
model = models.huggingface.AutoSeq2SeqLM(args.model,device=args.device, batch_size=1)
results = evaluator.evaluate(
model,
task_dict,
limit=100
)
{'input_ids': tensor([[ 105, 6936, 2298, 29715, 9439, 1239, 37, 388, 3993, 26, 44, 376, 5, 19783, 737, 22, 17, 43, 3, 9, 11354, 125, 47, 352, 30, 5, 216, 2299, 12, 112, 2743, 5, 105, 3696, 51, 270, 22, 7, 131, 2301, 139, 8, 629, 44, 8, 2007, 13, 8, 9956, 1239, 105, 15046, 269, 1239, 105, 8952, 670, 192, 6, 2087, 386, 6, 2286, 550, 642, 3059, 243, 11, 3993, 26, 44, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} (Pdb) p targets_tokens {'input_ids': tensor([[19783, 1]]), 'attention_mask': tensor([[1, 1]])} ... (Pdb) p greedy_tokens tensor([19783, 5]) (Pdb) p target_tokens tensor([19783, 1])
This seems pretty reasonable to me. Do you get expected results with the flag set false?
Hi @StellaAthena, the ppl and accuracy of google/flan-t5-xl on lambada I got with add_special_tokens=False is
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| lambada_openai | 0 | ppl | 360.4850 | ± | 28.7851 |
| acc | 0.2987 | ± | 0.0064 |
and with add_special_tokens=True is
| Task | Version | Metric | Value | Stderr | |
|---|---|---|---|---|---|
| lambada_openai | 0 | ppl | 913.6121 | ± | 40.5159 |
| acc | 0.0076 | ± | 0.0012 |
However, I cannot find the expected lambada accuracy and ppl from model card https://huggingface.co/google/flan-t5-xl and paper https://arxiv.org/pdf/2210.11416.pdf. Since lambada is a part of the finetune dataset seen from model card, 29.8% accuracy is still very low. If you have the SOTA of google/flan-t5-xl on lambada please share with me. Thanks a lot!
I don't have any information on this. As far as I am aware this is the correct value. If you want to study further, you can examine the per-example generations and see if you see anything weird.
Thanks @StellaAthena, do you mean to call generate and check the output? I will have a try. I also sent an email to t5 author, hope we can get a feedback.
Yes, you can see how to do this in the eval harness here
Checked with google/flan-t5-xl author, the recommended way to run lambada on this model is to append EOS at the end of input and targets in _model_call(), while when we compute word accuracy and word perplexity based on outputs we can just compare the last word only and ignore the EOS. @StellaAthena to implement this, we could need to set add_special_tokens=True and customize the _loglikelihood_tokens function to skip the EOS when compute ppl and accuracy, what is is your suggestions?
I meet some problems when using metric 'exact match' on t5-XXL or t5-Xl as well (BBH evaluation, got low-quality generation but have no idea so far)
@daisyden Hi, may I know how did you solve the issue? I'm trying to run lambada evaluation on t5-base and got the same issue. The perplexity is extremely high and accuracy is almost 0.
Me too. mt5-xl has very high word-level perplexity.
@lintangsutawika is traveling this week but might have thoughts on this when he's back as he's worked a lot with T5 and T5-like models!
Separately to possible issues re: ppl computation or special tokens, though, are the T5 models you are evaluating on trained to perform L-to-R language modeling @wangyanbao666 @djstrong ? I'd expect a T5 model trained only on span denoising / MLM to perform quite poorly on language modeling tasks like lambada or wikitext/perplexities.
They are MLM models so maybe it doesn't make sense or it needs to add MASK token add the end of each sub sequence. I will remove these models from calculating perplexity.