ModernBERT icon indicating copy to clipboard operation
ModernBERT copied to clipboard

NaN rotaty_emb weights in exported pretrained models

Open ahxxm opened this issue 9 months ago • 5 comments

I used pretraining_documentation/convert_to_hf.py to convert pretrain checkpoint pt to HF models, as some model variants don't need to extend context window, but I found that output logits are nan, and rotaty_emb is nan in local attention layers

The fix is simple, just change config.json content:

  • old: local_rope_theta: -1
  • fixed: local_rope_theta: 10000.0, the same as global and as in training

and reload model from dir again, then model outputs totally make sense.

I guess this is because pretrain yamls don't have this value, and FlexBERTConfig defaults to -1, but I'm not quite sure what would be a proper fix.

ahxxm avatar Mar 20 '25 13:03 ahxxm

Thank you, this indeed fixes the same problem I've been having with a checkpoint using a pretrain config!

Rijgersberg avatar Mar 21 '25 10:03 Rijgersberg

Good catch! I wrote the conversion script on top of decayed model, so we had a local_rope_theta set I guess this line of the converted should extend the check to -1 value and copy the value from the global layers.

The issue arise from the fact that in the research repo, we check for equality with -1 while in the transformers implementation, they check for None. I guess we could also set it to None, but maybe it's cleaner to set it and not have None in a config?

NohTow avatar Mar 21 '25 11:03 NohTow

I'm ok with either!

ahxxm avatar Apr 10 '25 11:04 ahxxm

Just wanted to say much thanks for the help guys, this was exactly the issue I ran into after pretraining. Much love!

frammiie avatar May 10 '25 13:05 frammiie

FYI, I fixed it in this commit by copying the global value into the local one when exporting a model that has a local value to -1. Sorry it took so long, the pre-training branch should be merged soon!

NohTow avatar Jun 19 '25 15:06 NohTow