ZwormZ comments

Results 9 comments of


                                            ZwormZ

Training duration & NaNs during training

> Thanks for getting back to me! > > Regarding the NaNs: I haven't been able to investigate further (no access atm, will do once I fix the other issues)...

Frequent loss is NaN & Training Hangs

Thanks for your quick response! I downloaded your code three times, the first was the original version, the second was on December 6, and the third time was today when...

Frequent loss is NaN & Training Hangs

Thanks for your reply! About the iteration time, the model does run on GPUs, but according to the monitoring, when using deepspeed for multi-card training, the GPUs are occupied but...

Frequent loss is NaN & Training Hangs

Thank you ! I tried move the training dataset to disk and set DataLoader workers to 16. The iteration time did get much shorter, it is now about 60s/it using...

Frequent loss is NaN & Training Hangs

Sure, how can I send it to you conveniently? By email? I will send you the original output loss detail after training 26 epochs, and I'm training again and print...

Frequent loss is NaN & Training Hangs

[loss_1000_6cards.log](https://github.com/aqlaboratory/openfold/files/7843646/loss_1000_6cards.log)

Frequent loss is NaN & Training Hangs

Sorry for the late reply, I spent some time retraining on two V100 cards , here is the loss breakdown. [loss_1000_2cards.log](https://github.com/aqlaboratory/openfold/files/7848852/loss_1000_2cards.log) For visualization, I drew the constituent losses diagram for...

Frequent loss is NaN & Training Hangs

update：after ~4 epoch ![distogram_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062880-4911432d-0ea6-4948-94a3-52639fab2497.png) ![fape_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062882-694d3e98-822e-472c-b1c5-6edd81dd61a5.png) ![lddt_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062884-73a62b47-37d5-46c1-9c5c-81dd22e0b9a1.png) ![masked_msa_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062887-de22bd22-bbbb-4077-9c25-c6b217939151.png) ![supervised_chi_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062890-3919fa33-81ee-48c3-9ab9-97e5f4d8b618.png) ![cumulative_loss_4eopch](https://user-images.githubusercontent.com/52192467/149062892-ea230408-9234-42c6-a0bd-02dc41b86b4f.png) ![loss_4epoch](https://user-images.githubusercontent.com/52192467/149062897-0b3f1d1b-718d-4cbd-a983-4ea17abd0dcf.png)

Frequent loss is NaN & Training Hangs

I'd like to test if it's the data that causes the iteration times to be very different in my training process than in your case. It will be great if...