PL-BERT icon indicating copy to clipboard operation
PL-BERT copied to clipboard

Irregular Loss Pattern ; getting "Loss:NaN"

Open SoshyHayami opened this issue 6 months ago • 1 comments

TL;DR:

  • Encountering frequent NaN values mainly for the Loss, during training with a large JPN dataset (10.5 million rows).
  • No such issues with another, albeit smaller dataset (800,000 rows).
  • Should I ignore NaN values or reverting to the other dataset? considering how much smaller it is.
  • Attempted to disable mixed precision during training, but the issue remains unresolved.

Hi. I'm trying to train a PL-Bert on Japanese language. I used the entirety of this dataset for that purpose.

Somehow I'm getting a lot of NaN, while Vocab Loss(for the most part, on some rare occassions I also get NaN for this as well) and Tokenizer Loss seem to be doing fine. I've also tried using the whole vocab size of the tokenizer in case something was wrong with the way I pruned it, but nope, still getting the same.

If i decrease the log steps (to 10 for instance), I see Loss to be around 2 to 3 but then goes to NaN, and back and forth.

Step [5100/1000000], Loss: nan, Vocab Loss: 1.12363, Token Loss: 2.01707
Step [5200/1000000], Loss: nan, Vocab Loss: 1.15805, Token Loss: 1.97737
Step [5300/1000000], Loss: nan, Vocab Loss: 1.24844, Token Loss: 1.88506
Step [5400/1000000], Loss: nan, Vocab Loss: 1.18666, Token Loss: 1.90820
Step [5500/1000000], Loss: nan, Vocab Loss: 1.33804, Token Loss: 2.04283
Step [5600/1000000], Loss: nan, Vocab Loss: 1.18824, Token Loss: 1.99786
Step [5700/1000000], Loss: nan, Vocab Loss: 0.98660, Token Loss: 1.84933
Step [5800/1000000], Loss: nan, Vocab Loss: 1.19794, Token Loss: 2.06009
Step [5900/1000000], Loss: nan, Vocab Loss: 1.12529, Token Loss: 2.08546
Step [6000/1000000], Loss: nan, Vocab Loss: 1.10970, Token Loss: 1.98083
Step [6100/1000000], Loss: nan, Vocab Loss: nan, Token Loss: 1.96394
Step [6200/1000000], Loss: nan, Vocab Loss: 1.10657, Token Loss: 1.97735

I should say that I'm seeing this pattern only on this particular dataset; I ran a short test session on this one, while keeping everything else constant and unchanged. this one seem to be working fine. Should I simply ignore the NaN or should I change back my dataset? (the problematic dataset is roughly 10.5M rows, if a good model can be trained with 800k (the dataset that works fine) then I guess I should do that?)

I have also tried disabling the mixed_precision, but it still did not help.


Here's my config:

log_dir: "Checkpoint"
mixed_precision: "fp16"
data_folder: "/home/ubuntu/001_PLBERT_JA/PL-BERT/jpn_wiki"
batch_size: 72
save_interval: 5000
log_interval: 100
num_process: 1 # number of GPUs
num_steps: 1000000

dataset_params:
    tokenizer: "cl-tohoku/bert-base-japanese-v2"
    token_separator: " " # token used for phoneme separator (space)
    token_mask: "M" # token used for phoneme mask (M)
    word_separator: 14 # token used for word separator (<unused9>)
    token_maps: "token_maps.pkl" # token map path
    
    max_mel_length: 512 # max phoneme length
    
    word_mask_prob: 0.15 # probability to mask the entire word
    phoneme_mask_prob: 0.1 # probability to mask each phoneme
    replace_prob: 0.2 # probablity to replace phonemes
    
model_params:
    vocab_size: 178
    hidden_size: 768
    num_attention_heads: 12
    intermediate_size: 2048
    max_position_embeddings: 512
    num_hidden_layers: 12
    dropout: 0.1

I'm training on 2x V100s (32gb each) Thank you very much.

SoshyHayami avatar Feb 10 '24 07:02 SoshyHayami