openfold icon indicating copy to clipboard operation
openfold copied to clipboard

nan problem while training

Open liuxm117 opened this issue 3 years ago • 7 comments
trafficstars

according to debug the code, I find there is a problem with the data entering the model, some values which prefixed with template is nan

liuxm117 avatar Mar 22 '22 08:03 liuxm117

Could you elaborate? Where are the nan values being prefixed for you?

gahdritz avatar Mar 22 '22 20:03 gahdritz

openfold/model/model.py

t = build_template_pair_feat(
                single_template_feats,
                inf=self.config.template.inf,
                eps=self.config.template.eps,
                **self.config.template.distogram,
            ).to(z.dtype)

I print the t,the output is follows: Uploading 微信图片_20220323110033.png…

liuxm117 avatar Mar 23 '22 03:03 liuxm117

tensor([[[[[0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.], ..., [0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.], [0., 0., 0., ..., nan, nan, 0.]],

      [[0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.],
       ...,
       [0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.]],

      [[0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.],
       ...,
       [0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.],
       [0., 0., 0.,  ..., nan, nan, 0.]],

liuxm117 avatar Mar 23 '22 03:03 liuxm117

Do any of the inputs to that function contain nan? Would you mind pinpointing where the nans first occur for you?

gahdritz avatar Mar 25 '22 18:03 gahdritz

inputs do not contain nan, first occur in model.py 268 row ''' z = z + template_embeds["template_pair_embedding"] ''' z is nan, problem is template_embeds["template_pair_embedding"]

liuxm117 avatar Mar 28 '22 06:03 liuxm117

I'm assuming this is FP16 training?

gahdritz avatar Mar 28 '22 18:03 gahdritz

yes, I use fp16,is it due to data precision?

liuxm117 avatar Mar 29 '22 01:03 liuxm117