Pretraining-T5-PyTorch-Lightning icon indicating copy to clipboard operation
Pretraining-T5-PyTorch-Lightning copied to clipboard

Can I use this code fo pretraining other types of T5?

Open iamcrysun opened this issue 2 years ago • 4 comments
trafficstars

Hi there! I have prepared a dataset using your code. But now I have a problem with training another t5 model (cointegrated/rut5-base-multitask). IndexError: index out of range in self. Where is my error?

iamcrysun avatar Nov 22 '23 12:11 iamcrysun

It should work with any seq2seq model, although the separators are T5-specific. What is your complete backtrace for the error?

manueldeprada avatar Nov 22 '23 14:11 manueldeprada

Before the appearance of this error, there were some other problems that could be fixed. This problem occurs when trying to retrain some other T5 models output (3).txt

iamcrysun avatar Nov 23 '23 07:11 iamcrysun

Before the appearance of this error, there were some other problems that could be fixed. This problem occurs when trying to retrain some other T5 models output (3).txt

It looks like something is wrong with tokenization, where a token_id is generated that does not exist in the model. Have you changed the tokenizer to be the new T5 model?? Maybe look if there is a hardcoded T5Tokenizer or a pad token somewhere. The problem for sure lies around there!

manueldeprada avatar Dec 01 '23 17:12 manueldeprada

Probably this is the issue: https://github.com/manueldeprada/Pretraining-T5-PyTorch-Lightning/blob/1c24be36ec77b22c74fc956ff2728e71db374d91/prepare_dataset.py#L115

Can you change it to be the class of your model or AutoTokenizer and check if it works?

manueldeprada avatar Dec 01 '23 17:12 manueldeprada