openfold
openfold copied to clipboard
nan problem while training
according to debug the code, I find there is a problem with the data entering the model, some values which prefixed with template is nan
Could you elaborate? Where are the nan values being prefixed for you?
openfold/model/model.py
t = build_template_pair_feat(
single_template_feats,
inf=self.config.template.inf,
eps=self.config.template.eps,
**self.config.template.distogram,
).to(z.dtype)
I print the t,the output is follows:
tensor([[[[[0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.], ..., [0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.]],
[[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.],
...,
[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.]],
[[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.],
...,
[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.],
[0., 0., 0., ..., nan, nan, 0.]],
Do any of the inputs to that function contain nan? Would you mind pinpointing where the nans first occur for you?
inputs do not contain nan, first occur in model.py 268 row ''' z = z + template_embeds["template_pair_embedding"] ''' z is nan, problem is template_embeds["template_pair_embedding"]
I'm assuming this is FP16 training?
yes, I use fp16,is it due to data precision?