ModernBERT MaskedLM nan training loss

I have been trying to run pre-training on a fineweb subset with ModernBERT using HuggingFace transformers (I don't see a way to use this repo yet for pre-training).

First, I tokenize my dataset:

hf_tokenizer = PreTrainedTokenizerFast.from_pretrained("answerdotai/ModernBERT-base")

def tokenize_function(examples):
    return hf_tokenizer(examples["text"],truncation=True)

tokenized_dataset = ds_select.map(
    tokenize_function,
    batched=True, 
    batch_size=1000, 
)

Then, I initialize a ModernBERT model:

bert_config = ModernBertConfig(
    global_rope_theta=10000,
    pad_token_id=hf_tokenizer.pad_token_id,
    bos_token_id=hf_tokenizer.bos_token_id,
    eos_token_id=hf_tokenizer.eos_token_id,
    cls_token_id=hf_tokenizer.cls_token_id,
    sep_token_id=hf_tokenizer.sep_token_id,
)
model = ModernBertForMaskedLM(bert_config)

I set up a DataCollator with the recommended mlm_probability:

data_collator = DataCollatorForLanguageModeling(
    tokenizer=hf_tokenizer, mlm=True, mlm_probability=0.3
)

and start the training:

trainer = LoggingTrainer(
    model=model,
    args=training_args,
    train_dataset=split_datasets["train"].shuffle(),
    eval_dataset=split_datasets["test"].shuffle(),
    data_collator=data_collator,
    processing_class=hf_tokenizer,
)
trainer.train()

Right on the first example I get a nan loss:

Loss:  tensor([10.8572,     nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281,   510,  6146,  ...,  7355, 50284, 50282],
        [50281,   510, 34461,  ..., 50283, 50283, 50283]], device='cuda:0')
attention_mask: tensor([[1, 1, 1,  ..., 1, 1, 1],
        [1, 1, 1,  ..., 0, 0, 0]], device='cuda:0')
labels: tensor([[-100, -100, -100,  ..., -100,   15, -100],
        [-100, -100, -100,  ..., -100, -100, -100]], device='cuda:0')
Loss:  tensor([nan, nan], device='cuda:0', grad_fn=<GatherBackward>)
Faulty inputs detected:
input_ids: tensor([[50281, 25897,    13,  ..., 50283, 50283, 50283],
        [50281,   510,   941,  ..., 50284,    15, 50282]], device='cuda:0')
attention_mask: tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 1, 1, 1]], device='cuda:0')
labels: tensor([[-100, -100, -100,  ..., -100, -100, -100],
        [-100, -100, -100,  ..., 2774, -100, -100]], device='cuda:0')

Notice how the labels don't seem to be aligned (50284 vs. 15)? What am I doing wrong here? I have done pretraining with other models using the transformers library and haven't run into this kind of problem before. I would be thankful for any guidance.

Dec 26 '24 00:12 yzimmermann

Similarly, I encountered a similar issue where my loss was always 0, but when I replaced it with the BERT model, the loss decreased normally.

Jan 08 '25 08:01 GithubX-F

You can try FlashAttention2, for example:

AutoModelForMaskedLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    ignore_mismatched_sizes=True,
    torch_dtype=torch.bfloat16 if is_torch_bf16_gpu_available() else torch.float16,
    attn_implementation=ATTN_IMPLEMENTATION
).to("cuda" if torch.cuda.is_available() else "cpu")

Jan 08 '25 11:01 GithubX-F

I've encountered this problem. Is there a solution available?

Jan 14 '25 13:01 gi2wzh

Hi everyone. I confirm that I am also experiencing the same issue.

Nov 26 '25 18:11 alejandrojcastaneira

Hello, Sorry for the delay.

I never tried running MLM with ModernBERT using HF. Maybe you can try opening an issue on HF just to make sure that this isn't something related to the implementation here? It seems that people had some issues before updating torch/flash-attention (or even transformers), so maybe this is the issue as well? See this, this, this or even here on the HF repo here, here and here.

Nov 27 '25 12:11 NohTow

I can reproduce NaN issues reliably by training a HF modernbert model in fp16 without using AMP, no problems with AMP thus far.

Training a HF ModernBERT model in pure fp16 causes NaN weights after the first optimizer step. The root cause is Adam's default eps=1e-8 rounding to zero in fp16, causing division by zero in the weight update. AMP avoids this by keeping weights in fp32 where eps=1e-8 is valid.

#!/usr/bin/env python3
"""Adam eps=1e-8 becomes 0 in fp16, causing division by zero."""

import torch
from transformers import ModernBertConfig, ModernBertModel

config = ModernBertConfig(vocab_size=256, pad_token_id=None, hidden_size=64,
                          num_hidden_layers=1, num_attention_heads=2)

def test(eps):
    model = ModernBertModel(config).half().train()
    optimizer = torch.optim.AdamW(model.parameters(), eps=eps)
    model(torch.randint(0, 256, (1, 16))).last_hidden_state.mean().backward()
    optimizer.step()
    return sum(torch.isnan(p).any().item() for p in model.parameters())

print(f"eps=1e-8 in fp16: {torch.tensor(1e-8).half().item()}")
print(f"NaN params: eps=1e-8 -> {test(1e-8)}, eps=1e-4 -> {test(1e-4)}")

It's interesting that this is such a prevalent issue for modernbert in particular when it should be possible to run into the same bug using any model and trigger the division by zero in Adam. Claude points out these three candidates as the most likely causes after reviewing the modeling_modernbert.py code from the transformers repo.

  1. LayerNorm eps=1e-5 - At the edge of fp16 precision (smallest normal ~6e-5)
  2. GeGLU activation - Element-wise multiplication act(x) * gate can produce very small values
  3. Attention mask uses finfo.min - fp16 min is -65504, adding to attention scores before softmax

Dec 12 '25 00:12 beapirate

ModernBERT was trained in AMP-BF16. Finetuning in pure FP16 isn't an expected use case. Use AMP-BF16 or AMP-FP16 with PyTorch SDPA or preferably Flash Attention installed.

Dec 12 '25 03:12 warner-benjamin