ModernBERT NaN rotaty_emb weights in exported pretrained models

I used pretraining_documentation/convert_to_hf.py to convert pretrain checkpoint pt to HF models, as some model variants don't need to extend context window, but I found that output logits are nan, and rotaty_emb is nan in local attention layers

The fix is simple, just change config.json content:

old: local_rope_theta: -1
fixed: local_rope_theta: 10000.0, the same as global and as in training

and reload model from dir again, then model outputs totally make sense.

I guess this is because pretrain yamls don't have this value, and FlexBERTConfig defaults to -1, but I'm not quite sure what would be a proper fix.

Mar 20 '25 13:03 ahxxm

Thank you, this indeed fixes the same problem I've been having with a checkpoint using a pretrain config!

Mar 21 '25 10:03 Rijgersberg

Good catch! I wrote the conversion script on top of decayed model, so we had a local_rope_theta set I guess this line of the converted should extend the check to -1 value and copy the value from the global layers.

The issue arise from the fact that in the research repo, we check for equality with -1 while in the transformers implementation, they check for None. I guess we could also set it to None, but maybe it's cleaner to set it and not have None in a config?

Mar 21 '25 11:03 NohTow

I'm ok with either!

Apr 10 '25 11:04 ahxxm

Just wanted to say much thanks for the help guys, this was exactly the issue I ran into after pretraining. Much love!

May 10 '25 13:05 frammiie

FYI, I fixed it in this commit by copying the global value into the local one when exporting a model that has a local value to -1. Sorry it took so long, the pre-training branch should be merged soon!

Jun 19 '25 15:06 NohTow