unilm LayoutLMv2 nan training loss and eval

Describe the bug Model I am using is LayoutLMv2 with custom dataset.

The problem arises when using:

[x] the official example scripts: I am using the same run_funsd.py, but using a modified dataset.

To Reproduce Steps to reproduce the behavior:

run_funsd.py --do_eval=True --do_predict=True --do_train=True --early_stop_patience=4 --evaluation_strategy=epoch --fp16=True --load_best_model_at_end=True --max_train_samples=1000 --model_name_or_path=microsoft/layoutlmv2-base-uncased --num_train_epochs=30 --output_dir=/tmp/test-ner --overwrite_output_dir=True --report_to=wandb --save_strategy=epoch --save_total_limit=1 --warmup_ratio=0.1

Fortunately, I recorded everything with wandb.

After 8 epochs the training and eval loss went to nan, while the f1 score dropped suddenly. The samples per second increased significantly as well.

Platform:
Python version: 3.7.1
PyTorch version (GPU?): tesla T4

Jun 06 '21 14:06 victor-ab

@victor-ab I was trying to use a dataset with a sample annotation like https://github.com/doc-analysis/DocBank/blob/master/DocBank_samples/DocBank_samples/10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt are there any pointers on how to convert the dataset to the run_funsd.py script format?

Jun 10 '21 00:06 saksham-s

@victor-ab Have you figured out where the problem is? I also used the official example script run_funsd.py with my own custom dataset and got nan training loss. The run_funsd.py script runs perfectly well on funsd dataset.

Jun 15 '21 05:06 xianshu1

same question, maybe something wrong with spatial aware self attention

Jul 28 '21 02:07 XueAdas

Hey where is the option to provide custom dataset path in run_funds.py file.

Jul 29 '21 15:07 kbrajwani

@victor-ab @xianshu1 @XueAdas I had a similar issue. I tried a lower learning_rate (0.00001) and it works for me now but the training takes quite a long time to get the loss to the level I want it to be. I guess it is the price to pay.

Aug 25 '21 18:08 bkwapong

I am seeing this behavior and after fighting it the whole day I wanted to share my progress on it. For me it originates in the sequence output of layoutlm containing NaN values, when doing something like below:

outputs = self.layoutlmv2(
        input_ids=input_ids,
        bbox=bbox,
        image=image,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
)
seq_length = input_ids.size(1)
sequence_output, image_output = (
        outputs[0][:, :seq_length],
        outputs[0][:, seq_length:],
)

The loss being NaN seems to be a secondary symptom 🤔

It still seems to be random, but I am trying to narrow it down right now.

So far I excluded the following root causes:

Non of the parameters is NaN
Non of the parameters is 0
Non of the inputs is NaN
The input ids are all within the range of the embeddings

As mentioned by @bkwapong lowering the learning rate below 5e-5, seems to resolve this issue 😉 I hope this proves useful to some of you and will keep you posted if I find an actual solution 🙂

Oct 13 '21 14:10 jendrikjoe

Okay continuing my debug process here: In LayoutLMv2ForTokenClassification the loss is calculated in the following manor:

loss = None
if labels is not None:
    loss_fct = CrossEntropyLoss()

    if attention_mask is not None:
        active_loss = attention_mask.view(-1) == 1
        active_logits = logits.view(-1, self.num_labels)[active_loss]
        active_labels = labels.view(-1)[active_loss]
        loss = loss_fct(active_logits, active_labels)
    else:
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

The problem with this is, if active_labels contains a lot of -100 values. This is the case if one sets only_label_first_subword in the tokenizer to True. The issue hereby is that the CrossEntropyLoss ignores all losses where the label is -100. It takes all other values, and takes an average over them. Let's assume the extreme case where only one of the 512 labels is not -100. In that case the average is just the value divided by one. This as well means the gradient is only divided by one. Therefore, the gradient through the one active output is approximately 512 times higher than expected. If all labels are different from -100 the average is the sum of all losses divided by 512, meaning that the gradient is as well divided by 512.

What solved the problem for me is to calculate the loss' sum and divide by the maximum number of labels:

loss_fct = CrossEntropyLoss(reduction="sum")
if attention_mask is not None:
    active_loss = attention_mask.view(-1) == 1
    active_logits = logits.view(-1, self.num_labels)[active_loss]
    active_labels = labels.view(-1)[active_loss]
    class_loss = loss_fct(active_logits, active_labels)
else:
    class_loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

class_loss = class_loss / len(labels.view(-1))

This way the gradients are independent of the number of labels that are -100.

I hope this helps others 🙂 If my assumptions are wrong, I would love some input 👍

Oct 13 '21 20:10 jendrikjoe

Hey I am having a similar issue where the logits outputted by the model after training is nan. Any reason why this is happening? I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

I do not know if this is a problem but I am training it on multiple gpus using the Accelerate API of huggingface. Any help to this is much appreciated

Feb 14 '22 03:02 sathwikacharya

您发的邮件已收到，谢谢！ Your email has been received, thank you! Ihre e - mail bekommen, danke! あなたのメールが届きましたが、ありがとうございます！——————————————————————————Xue Xu Tel: @.***

Feb 14 '22 03:02 XueAdas

Perhaps there is no problem with the loss calculation code. In my case, I got NaN value only when calculating loss with autocast(), but when I stopped using amp, I no longer get NaN value. I hope this will be helpful to you.

NaN with AMP is a known issue. https://github.com/pytorch/pytorch/issues/40497

May 20 '22 01:05 magataro

您发的邮件已收到，谢谢！ Your email has been received, thank you! Ihre e - mail bekommen, danke! あなたのメールが届きましたが、ありがとうございます！——————————————————————————Xue Xu Tel: @.***

May 20 '22 01:05 XueAdas

unilm unilm copied to clipboard

LayoutLMv2 nan training loss and eval

unilm
unilm copied to clipboard