notebooks icon indicating copy to clipboard operation
notebooks copied to clipboard

mT5 fine-tune for en-my got "NaN" in training loss and validation loss

Open learnercat opened this issue 4 years ago • 2 comments
trafficstars

I tried to fine-tune mT5 for English->Myanmar translation from Tatoeba-Challenge Dataset. I followed to train this notebook example of en-ro translation. And I used model_checkpoint as "google/mt5-small". I tested 1~4 epoch training. The following is the training parameters, I reduced the batch_size as 4.

batch_size=4 args = Seq2SeqTrainingArguments( "mt5-translate-en-my", evaluation_strategy = "epoch", learning_rate=2e-5, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, weight_decay=0.01, save_total_limit=3, num_train_epochs=1, predict_with_generate=True, fp16=True, )

I got "NaN" in training loss and validation loss as below:

mt5_error

Can you please help me how do I do it? Thanks in advance.

learnercat avatar Apr 25 '21 08:04 learnercat

What kind of hardware are you using? Do you get the same issue if you set fp16=False?

msaroufim avatar Aug 04 '21 06:08 msaroufim

@msaroufim Thank you very much, it worked for me on colab. Also the warning lr_schedual.ster() before optimizer.step() disappeared. But why should set fp16=False even when I have A100 GPU on colab?

Majdoddin avatar Apr 25 '23 14:04 Majdoddin