openfold
openfold copied to clipboard
Frequent loss is NaN & Training Hangs
Thank you for sharing your code!
I am trying to train openfold, but the problem of loss being NAN persists, and the whole training hangs when this problem occurs.
I downloaded the code in early December and trained on 8 V100 cards with a training dataset size of 1000. When I ran to the 26th sample of the 2nd epoch, there were many warning outputs with a loss of NAN and the training was interrupted.
I read your solution of "Replace training_step
in train_openfold.py
" in Issue #19, after changing, when I train the first sample, I got this:
WARNING:root:loss is NaN. Returning 0 loss...
Training still hangs.
I ran a recent commit again and retrained with the same dataset and the same problem occurred again and on the same sample. Like this:
I changed the way the mapping is generated in data_modules.py
so that the dataset can be loaded in a fixed order when it is loaded, and I checked the data where the loss is NAN and found no abnormalities.
This is very strange, because with your first version of the code, there is no NAN loss so far, but with the version you committed after December this problem keeps occurring, even if I change my training dataset and the learning rate in the deepspeed config file, it does not improve the situation.
Is there a workaround for this situation?
What do you mean by "the version you committed after December"? Are you referring to a specific commit?
BTW: I just spotted a mistake in the training_step workaround and fixed it. Try again with the new version.
Thanks for your quick response!
I downloaded your code three times, the first was the original version, the second was on December 6, and the third time was today when I downloaded the latest version of your commit. The initial version did not have the problem with the loss being NAN, but both subsequent versions did.
I rerun the new version and there is the result :
This is the same situation as the version I downloaded on December 6 and ran with the same results.
I'll look into this. That the loss hits NaN and then stays that way is fairly common, but I'm very surprised to hear you didn't encounter the same issue using the initial release, which was worse for me in this regard. If you also have time to look into where the NaNs are happening, that would be great---the more datapoints, the better.
By the way: you shouldn't have to be waiting 160s per iteration. On my 2080 Tis, the default setting takes 16 seconds per iteration. Are you sure the model is running on your GPUs?
Thanks for your reply!
About the iteration time, the model does run on GPUs, but according to the monitoring, when using deepspeed for multi-card training, the GPUs are occupied but the GPU memory is not always utilized,they will be utilized about 60% after 10 minutes interval, it seems that the training is stuck during this time. However, this did not happen when using one card for training, and the average time for each iteration was about 40s for single card training (1000 proteins were randomly selected in the pdb_mmcif file as the training set), so in my case the training speed with one card was about 1 times faster than training with 8 cards.
The details are shown in the figure below:
I'm not sure if this is the cause, as it has been the case since my first training session, is this normal?
Kind of seems like you might be bottlenecked by data processing. Maybe try increasing the number of DataLoader workers, or pinning GPU memory for the DataLoader workers?
Thank you !
I tried move the training dataset to disk and set DataLoader workers to 16. The iteration time did get much shorter, it is now about 60s/it using default deepspeed config(same as you provided)for training on 6 v100 cards, but this is much longer than the 16s/it as you said. I also trained on a single card, got 40s/it. So training speed with one card is still faster than training with multi-card in my case.
I also want to test if it is the data that is causing the iterations to slow down. It will be great if you can share with me a toy small data for the test you used in issue #34?
Also, I trained about 26 epochs on 1000 datasets, with 6 V100 cards, using the initial release code, and I drew the loss graph as follows:
It looks like the loss seems to stay around 3. and then not drop any further, did the loss also stay around 3. when you were training? I would like to know if this is normal.
Could you send the breakdown of the loss? Go to the definition of AlphaFoldLoss
in openfold/utils/loss
and print out the component parts of the cumulative loss.
Sure, how can I send it to you conveniently? By email?
I will send you the original output loss detail after training 26 epochs, and I'm training again and print out the component parts of the cumulative loss, which may take some time.
Could you just post it here?
Sorry if I was unclear, but I meant printing out the values of each of the constituent losses in openfold/utils/loss.py
(e.g. FAPE loss, distogram loss, etc.). I want to see if one of the losses is dominating the final total of ~4.
Sorry for the late reply, I spent some time retraining on two V100 cards , here is the loss breakdown.
For visualization, I drew the constituent losses diagram for each part, as follows:
update:after ~4 epoch
I'd like to test if it's the data that causes the iteration times to be very different in my training process than in your case. It will be great if you can share with me a toy small data for the test you used in issue #34?
I got NaN loss in the training as well. I used --precision 16
option, which might lead to overflow in this case, see https://github.com/pytorch/pytorch/issues/40497
A closer inspection suggests that in the TriangleAttentionStartingNode
module where the inf is set to be 1e9
, mask_bias = (self.inf * (mask - 1))[..., :, None, None, :]
yields -inf
.
I tried to convert mask
into fp32 and the -inf
disappears, but no luck in converting it back to fp16.
BTW, the largest self.inf
which fp16 can handle in my case in only 6e4
, not sure if it's safe or accurate to do so.
Edit:
If I don't enable deepspeed
, it works fine using V100 GPU with fp16 training, but otherwise gives NaN losses on RTX-2080.
1e-9 is just the default---it should be overridden by the config file when you enable --precision 16
. I've been focusing on bfloat16 training and implementing Multimer for the past couple of months, but I should have the bandwidth to the FP16 issues soon.
@gahdritz Thanks for the reply! It looks like the overflow is a persistent bug in the mixed precision training and maybe related to deepspeed as well. I'll let you know if I can figure out the solution.