lm-evaluation-harness The tokenizer add_special_tokens parameter for t5 model lambada task

When we run lambada_openai on google/flan-t5-xl, both input token and labels are end with EOS because by default add_special_tokens=True for Seq2Seq model, however the output of the model_call does not have EOS and the accuracy is always 0. As lambada dataset input is not a full sentence, can we set add_special_tokens=False to run lambada for t5 models? Or please help to suggest how to get correct result on lambada task for t5 models.

If you have a reference data for google/flan-t5-xl lambada_openai, please kindly share with me. Thanks!

task_dict = tasks.get_task_dict(task_names)
model = models.huggingface.AutoSeq2SeqLM(args.model,device=args.device, batch_size=1)
results = evaluator.evaluate(
    model,
    task_dict,
    limit=100
)

{'input_ids': tensor([[ 105, 6936, 2298, 29715, 9439, 1239, 37, 388, 3993, 26, 44, 376, 5, 19783, 737, 22, 17, 43, 3, 9, 11354, 125, 47, 352, 30, 5, 216, 2299, 12, 112, 2743, 5, 105, 3696, 51, 270, 22, 7, 131, 2301, 139, 8, 629, 44, 8, 2007, 13, 8, 9956, 1239, 105, 15046, 269, 1239, 105, 8952, 670, 192, 6, 2087, 386, 6, 2286, 550, 642, 3059, 243, 11, 3993, 26, 44, 1]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])} (Pdb) p targets_tokens {'input_ids': tensor([[19783, 1]]), 'attention_mask': tensor([[1, 1]])} ... (Pdb) p greedy_tokens tensor([19783, 5]) (Pdb) p target_tokens tensor([19783, 1])

Nov 22 '23 11:11 daisyden

This seems pretty reasonable to me. Do you get expected results with the flag set false?

Nov 23 '23 14:11 StellaAthena

Hi @StellaAthena, the ppl and accuracy of google/flan-t5-xl on lambada I got with add_special_tokens=False is

Task	Version	Metric	Value		Stderr
lambada_openai	0	ppl	360.4850	±	28.7851
		acc	0.2987	±	0.0064

and with add_special_tokens=True is

Task	Version	Metric	Value		Stderr
lambada_openai	0	ppl	913.6121	±	40.5159
		acc	0.0076	±	0.0012

However, I cannot find the expected lambada accuracy and ppl from model card https://huggingface.co/google/flan-t5-xl and paper https://arxiv.org/pdf/2210.11416.pdf. Since lambada is a part of the finetune dataset seen from model card, 29.8% accuracy is still very low. If you have the SOTA of google/flan-t5-xl on lambada please share with me. Thanks a lot!

Nov 25 '23 05:11 daisyden

I don't have any information on this. As far as I am aware this is the correct value. If you want to study further, you can examine the per-example generations and see if you see anything weird.

Nov 25 '23 14:11 StellaAthena

Thanks @StellaAthena, do you mean to call generate and check the output? I will have a try. I also sent an email to t5 author, hope we can get a feedback.

Nov 25 '23 14:11 daisyden

Yes, you can see how to do this in the eval harness here

Nov 25 '23 16:11 StellaAthena

Checked with google/flan-t5-xl author, the recommended way to run lambada on this model is to append EOS at the end of input and targets in _model_call(), while when we compute word accuracy and word perplexity based on outputs we can just compare the last word only and ignore the EOS. @StellaAthena to implement this, we could need to set add_special_tokens=True and customize the _loglikelihood_tokens function to skip the EOS when compute ppl and accuracy, what is is your suggestions?

Nov 27 '23 09:11 daisyden

I meet some problems when using metric 'exact match' on t5-XXL or t5-Xl as well (BBH evaluation, got low-quality generation but have no idea so far)

Nov 28 '23 18:11 milliemaoo

@daisyden Hi, may I know how did you solve the issue? I'm trying to run lambada evaluation on t5-base and got the same issue. The perplexity is extremely high and accuracy is almost 0.

Mar 14 '24 12:03 wangyanbao666

Me too. mt5-xl has very high word-level perplexity.

Mar 18 '24 16:03 djstrong

@lintangsutawika is traveling this week but might have thoughts on this when he's back as he's worked a lot with T5 and T5-like models!

Separately to possible issues re: ppl computation or special tokens, though, are the T5 models you are evaluating on trained to perform L-to-R language modeling @wangyanbao666 @djstrong ? I'd expect a T5 model trained only on span denoising / MLM to perform quite poorly on language modeling tasks like lambada or wikitext/perplexities.

Mar 18 '24 18:03 haileyschoelkopf

They are MLM models so maybe it doesn't make sense or it needs to add MASK token add the end of each sub sequence. I will remove these models from calculating perplexity.

Mar 18 '24 18:03 djstrong