NaN rotaty_emb weights in exported pretrained models
I used pretraining_documentation/convert_to_hf.py to convert pretrain checkpoint pt to HF models, as some model variants don't need to extend context window, but I found that output logits are nan, and rotaty_emb is nan in local attention layers
The fix is simple, just change config.json content:
- old:
local_rope_theta: -1 - fixed:
local_rope_theta: 10000.0, the same as global and as in training
and reload model from dir again, then model outputs totally make sense.
I guess this is because pretrain yamls don't have this value, and FlexBERTConfig defaults to -1, but I'm not quite sure what would be a proper fix.
Thank you, this indeed fixes the same problem I've been having with a checkpoint using a pretrain config!
Good catch! I wrote the conversion script on top of decayed model, so we had a local_rope_theta set I guess this line of the converted should extend the check to -1 value and copy the value from the global layers.
The issue arise from the fact that in the research repo, we check for equality with -1 while in the transformers implementation, they check for None. I guess we could also set it to None, but maybe it's cleaner to set it and not have None in a config?
I'm ok with either!
Just wanted to say much thanks for the help guys, this was exactly the issue I ran into after pretraining. Much love!
FYI, I fixed it in this commit by copying the global value into the local one when exporting a model that has a local value to -1. Sorry it took so long, the pre-training branch should be merged soon!