openfold icon indicating copy to clipboard operation
openfold copied to clipboard

Frequent loss is NaN & Training Hangs

Open ZwormZ opened this issue 2 years ago • 17 comments

Thank you for sharing your code!

I am trying to train openfold, but the problem of loss being NAN persists, and the whole training hangs when this problem occurs.

I downloaded the code in early December and trained on 8 V100 cards with a training dataset size of 1000. When I ran to the 26th sample of the 2nd epoch, there were many warning outputs with a loss of NAN and the training was interrupted. I read your solution of "Replace training_step in train_openfold.py " in Issue #19, after changing, when I train the first sample, I got this:

WARNING:root:loss is NaN. Returning 0 loss...

Training still hangs.

I ran a recent commit again and retrained with the same dataset and the same problem occurred again and on the same sample. Like this:

image

I changed the way the mapping is generated in data_modules.py so that the dataset can be loaded in a fixed order when it is loaded, and I checked the data where the loss is NAN and found no abnormalities.

This is very strange, because with your first version of the code, there is no NAN loss so far, but with the version you committed after December this problem keeps occurring, even if I change my training dataset and the learning rate in the deepspeed config file, it does not improve the situation.

Is there a workaround for this situation?

ZwormZ avatar Jan 04 '22 16:01 ZwormZ

What do you mean by "the version you committed after December"? Are you referring to a specific commit?

BTW: I just spotted a mistake in the training_step workaround and fixed it. Try again with the new version.

gahdritz avatar Jan 04 '22 18:01 gahdritz

Thanks for your quick response!

I downloaded your code three times, the first was the original version, the second was on December 6, and the third time was today when I downloaded the latest version of your commit. The initial version did not have the problem with the loss being NAN, but both subsequent versions did.

I rerun the new version and there is the result : image This is the same situation as the version I downloaded on December 6 and ran with the same results.

ZwormZ avatar Jan 04 '22 18:01 ZwormZ

I'll look into this. That the loss hits NaN and then stays that way is fairly common, but I'm very surprised to hear you didn't encounter the same issue using the initial release, which was worse for me in this regard. If you also have time to look into where the NaNs are happening, that would be great---the more datapoints, the better.

By the way: you shouldn't have to be waiting 160s per iteration. On my 2080 Tis, the default setting takes 16 seconds per iteration. Are you sure the model is running on your GPUs?

gahdritz avatar Jan 04 '22 19:01 gahdritz

Thanks for your reply!

About the iteration time, the model does run on GPUs, but according to the monitoring, when using deepspeed for multi-card training, the GPUs are occupied but the GPU memory is not always utilized,they will be utilized about 60% after 10 minutes interval, it seems that the training is stuck during this time. However, this did not happen when using one card for training, and the average time for each iteration was about 40s for single card training (1000 proteins were randomly selected in the pdb_mmcif file as the training set), so in my case the training speed with one card was about 1 times faster than training with 8 cards.

The details are shown in the figure below: image

I'm not sure if this is the cause, as it has been the case since my first training session, is this normal?

ZwormZ avatar Jan 05 '22 08:01 ZwormZ

Kind of seems like you might be bottlenecked by data processing. Maybe try increasing the number of DataLoader workers, or pinning GPU memory for the DataLoader workers?

gahdritz avatar Jan 05 '22 18:01 gahdritz

Thank you !

I tried move the training dataset to disk and set DataLoader workers to 16. The iteration time did get much shorter, it is now about 60s/it using default deepspeed config(same as you provided)for training on 6 v100 cards, but this is much longer than the 16s/it as you said. I also trained on a single card, got 40s/it. So training speed with one card is still faster than training with multi-card in my case.

I also want to test if it is the data that is causing the iterations to slow down. It will be great if you can share with me a toy small data for the test you used in issue #34?

Also, I trained about 26 epochs on 1000 datasets, with 6 V100 cards, using the initial release code, and I drew the loss graph as follows: openfold_train

It looks like the loss seems to stay around 3. and then not drop any further, did the loss also stay around 3. when you were training? I would like to know if this is normal.

ZwormZ avatar Jan 10 '22 07:01 ZwormZ

Could you send the breakdown of the loss? Go to the definition of AlphaFoldLoss in openfold/utils/loss and print out the component parts of the cumulative loss.

gahdritz avatar Jan 10 '22 18:01 gahdritz

Sure, how can I send it to you conveniently? By email?

I will send you the original output loss detail after training 26 epochs, and I'm training again and print out the component parts of the cumulative loss, which may take some time.

ZwormZ avatar Jan 11 '22 04:01 ZwormZ

Could you just post it here?

gahdritz avatar Jan 11 '22 04:01 gahdritz

Sorry if I was unclear, but I meant printing out the values of each of the constituent losses in openfold/utils/loss.py (e.g. FAPE loss, distogram loss, etc.). I want to see if one of the losses is dominating the final total of ~4.

gahdritz avatar Jan 11 '22 05:01 gahdritz

Sorry for the late reply, I spent some time retraining on two V100 cards , here is the loss breakdown.

loss_1000_2cards.log

For visualization, I drew the constituent losses diagram for each part, as follows: distogram_loss fape_loss lddt_loss masked_msa_loss supervised_chi_loss cumulative_loss loss

ZwormZ avatar Jan 11 '22 17:01 ZwormZ

update:after ~4 epoch distogram_loss_4epoch fape_loss_4epoch lddt_loss_4epoch masked_msa_loss_4epoch supervised_chi_loss_4epoch cumulative_loss_4eopch loss_4epoch

ZwormZ avatar Jan 12 '22 04:01 ZwormZ

I'd like to test if it's the data that causes the iteration times to be very different in my training process than in your case. It will be great if you can share with me a toy small data for the test you used in issue #34?

ZwormZ avatar Jan 12 '22 04:01 ZwormZ

I got NaN loss in the training as well. I used --precision 16 option, which might lead to overflow in this case, see https://github.com/pytorch/pytorch/issues/40497

A closer inspection suggests that in the TriangleAttentionStartingNode module where the inf is set to be 1e9, mask_bias = (self.inf * (mask - 1))[..., :, None, None, :] yields -inf.

I tried to convert mask into fp32 and the -inf disappears, but no luck in converting it back to fp16.

BTW, the largest self.inf which fp16 can handle in my case in only 6e4, not sure if it's safe or accurate to do so.

Edit: If I don't enable deepspeed, it works fine using V100 GPU with fp16 training, but otherwise gives NaN losses on RTX-2080.

empyriumz avatar Mar 11 '22 18:03 empyriumz

1e-9 is just the default---it should be overridden by the config file when you enable --precision 16. I've been focusing on bfloat16 training and implementing Multimer for the past couple of months, but I should have the bandwidth to the FP16 issues soon.

gahdritz avatar Mar 11 '22 20:03 gahdritz

@gahdritz Thanks for the reply! It looks like the overflow is a persistent bug in the mixed precision training and maybe related to deepspeed as well. I'll let you know if I can figure out the solution.

empyriumz avatar Mar 11 '22 20:03 empyriumz