unilm icon indicating copy to clipboard operation
unilm copied to clipboard

LayoutLMv2 nan training loss and eval

Open victor-ab opened this issue 3 years ago โ€ข 11 comments

Describe the bug Model I am using is LayoutLMv2 with custom dataset.

The problem arises when using:

  • [x] the official example scripts: I am using the same run_funsd.py, but using a modified dataset.

To Reproduce Steps to reproduce the behavior:

run_funsd.py --do_eval=True --do_predict=True --do_train=True --early_stop_patience=4 --evaluation_strategy=epoch --fp16=True --load_best_model_at_end=True --max_train_samples=1000 --model_name_or_path=microsoft/layoutlmv2-base-uncased --num_train_epochs=30 --output_dir=/tmp/test-ner --overwrite_output_dir=True --report_to=wandb --save_strategy=epoch --save_total_limit=1 --warmup_ratio=0.1

Fortunately, I recorded everything with wandb.

image

After 8 epochs the training and eval loss went to nan, while the f1 score dropped suddenly. The samples per second increased significantly as well.

  • Platform:
  • Python version: 3.7.1
  • PyTorch version (GPU?): tesla T4

victor-ab avatar Jun 06 '21 14:06 victor-ab

@victor-ab I was trying to use a dataset with a sample annotation like https://github.com/doc-analysis/DocBank/blob/master/DocBank_samples/DocBank_samples/10.tar_1701.04170.gz_TPNL_afterglow_evo_8.txt are there any pointers on how to convert the dataset to the run_funsd.py script format?

saksham-s avatar Jun 10 '21 00:06 saksham-s

@victor-ab Have you figured out where the problem is? I also used the official example script run_funsd.py with my own custom dataset and got nan training loss. The run_funsd.py script runs perfectly well on funsd dataset.

xianshu1 avatar Jun 15 '21 05:06 xianshu1

same question, maybe something wrong with spatial aware self attention

XueAdas avatar Jul 28 '21 02:07 XueAdas

Hey where is the option to provide custom dataset path in run_funds.py file.

kbrajwani avatar Jul 29 '21 15:07 kbrajwani

@victor-ab @xianshu1 @XueAdas I had a similar issue. I tried a lower learning_rate (0.00001) and it works for me now but the training takes quite a long time to get the loss to the level I want it to be. I guess it is the price to pay.

bkwapong avatar Aug 25 '21 18:08 bkwapong

I am seeing this behavior and after fighting it the whole day I wanted to share my progress on it. For me it originates in the sequence output of layoutlm containing NaN values, when doing something like below:

outputs = self.layoutlmv2(
        input_ids=input_ids,
        bbox=bbox,
        image=image,
        attention_mask=attention_mask,
        token_type_ids=token_type_ids,
        position_ids=position_ids,
        head_mask=head_mask,
        inputs_embeds=inputs_embeds,
        output_attentions=output_attentions,
        output_hidden_states=output_hidden_states,
        return_dict=return_dict,
)
seq_length = input_ids.size(1)
sequence_output, image_output = (
        outputs[0][:, :seq_length],
        outputs[0][:, seq_length:],
)

The loss being NaN seems to be a secondary symptom ๐Ÿค”

It still seems to be random, but I am trying to narrow it down right now.

So far I excluded the following root causes:

  • Non of the parameters is NaN
  • Non of the parameters is 0
  • Non of the inputs is NaN
  • The input ids are all within the range of the embeddings

As mentioned by @bkwapong lowering the learning rate below 5e-5, seems to resolve this issue ๐Ÿ˜‰ I hope this proves useful to some of you and will keep you posted if I find an actual solution ๐Ÿ™‚

jendrikjoe avatar Oct 13 '21 14:10 jendrikjoe

Okay continuing my debug process here: In LayoutLMv2ForTokenClassification the loss is calculated in the following manor:

loss = None
if labels is not None:
    loss_fct = CrossEntropyLoss()

    if attention_mask is not None:
        active_loss = attention_mask.view(-1) == 1
        active_logits = logits.view(-1, self.num_labels)[active_loss]
        active_labels = labels.view(-1)[active_loss]
        loss = loss_fct(active_logits, active_labels)
    else:
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

The problem with this is, if active_labels contains a lot of -100 values. This is the case if one sets only_label_first_subword in the tokenizer to True. The issue hereby is that the CrossEntropyLoss ignores all losses where the label is -100. It takes all other values, and takes an average over them. Let's assume the extreme case where only one of the 512 labels is not -100. In that case the average is just the value divided by one. This as well means the gradient is only divided by one. Therefore, the gradient through the one active output is approximately 512 times higher than expected. If all labels are different from -100 the average is the sum of all losses divided by 512, meaning that the gradient is as well divided by 512.

What solved the problem for me is to calculate the loss' sum and divide by the maximum number of labels:

loss_fct = CrossEntropyLoss(reduction="sum")
if attention_mask is not None:
    active_loss = attention_mask.view(-1) == 1
    active_logits = logits.view(-1, self.num_labels)[active_loss]
    active_labels = labels.view(-1)[active_loss]
    class_loss = loss_fct(active_logits, active_labels)
else:
    class_loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

class_loss = class_loss / len(labels.view(-1))

This way the gradients are independent of the number of labels that are -100.

I hope this helps others ๐Ÿ™‚ If my assumptions are wrong, I would love some input ๐Ÿ‘

jendrikjoe avatar Oct 13 '21 20:10 jendrikjoe

Hey I am having a similar issue where the logits outputted by the model after training is nan. Any reason why this is happening? I am training this on a custom dataset with 29 classes and 40000 data points. The steps followed are identical to this link except for a few tweaks : https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

I do not know if this is a problem but I am training it on multiple gpus using the Accelerate API of huggingface. Any help to this is much appreciated

sathwikacharya avatar Feb 14 '22 03:02 sathwikacharya

ย  ๆ‚จๅ‘็š„้‚ฎไปถๅทฒๆ”ถๅˆฐ๏ผŒ่ฐข่ฐข๏ผ ย  Your email has been received, thank you! Ihre e - mail bekommen, danke! ใ‚ใชใŸใฎใƒกใƒผใƒซใŒๅฑŠใใพใ—ใŸใŒใ€ใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™๏ผโ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”Xue Xu ย ย ย Tel: @.***

XueAdas avatar Feb 14 '22 03:02 XueAdas

Perhaps there is no problem with the loss calculation code. In my case, I got NaN value only when calculating loss with autocast(), but when I stopped using amp, I no longer get NaN value. I hope this will be helpful to you.

NaN with AMP is a known issue. https://github.com/pytorch/pytorch/issues/40497

magataro avatar May 20 '22 01:05 magataro

ย  ๆ‚จๅ‘็š„้‚ฎไปถๅทฒๆ”ถๅˆฐ๏ผŒ่ฐข่ฐข๏ผ ย  Your email has been received, thank you! Ihre e - mail bekommen, danke! ใ‚ใชใŸใฎใƒกใƒผใƒซใŒๅฑŠใใพใ—ใŸใŒใ€ใ‚ใ‚ŠใŒใจใ†ใ”ใ–ใ„ใพใ™๏ผโ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”โ€”Xue Xu ย ย ย Tel: @.***

XueAdas avatar May 20 '22 01:05 XueAdas