ZwormZ

Results 9 comments of ZwormZ

> Thanks for getting back to me! > > Regarding the NaNs: I haven't been able to investigate further (no access atm, will do once I fix the other issues)...

Thanks for your quick response! I downloaded your code three times, the first was the original version, the second was on December 6, and the third time was today when...

Thanks for your reply! About the iteration time, the model does run on GPUs, but according to the monitoring, when using deepspeed for multi-card training, the GPUs are occupied but...

Thank you ! I tried move the training dataset to disk and set DataLoader workers to 16. The iteration time did get much shorter, it is now about 60s/it using...

Sure, how can I send it to you conveniently? By email? I will send you the original output loss detail after training 26 epochs, and I'm training again and print...

[loss_1000_6cards.log](https://github.com/aqlaboratory/openfold/files/7843646/loss_1000_6cards.log)

Sorry for the late reply, I spent some time retraining on two V100 cards , here is the loss breakdown. [loss_1000_2cards.log](https://github.com/aqlaboratory/openfold/files/7848852/loss_1000_2cards.log) For visualization, I drew the constituent losses diagram for...

update:after ~4 epoch ![distogram_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062880-4911432d-0ea6-4948-94a3-52639fab2497.png) ![fape_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062882-694d3e98-822e-472c-b1c5-6edd81dd61a5.png) ![lddt_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062884-73a62b47-37d5-46c1-9c5c-81dd22e0b9a1.png) ![masked_msa_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062887-de22bd22-bbbb-4077-9c25-c6b217939151.png) ![supervised_chi_loss_4epoch](https://user-images.githubusercontent.com/52192467/149062890-3919fa33-81ee-48c3-9ab9-97e5f4d8b618.png) ![cumulative_loss_4eopch](https://user-images.githubusercontent.com/52192467/149062892-ea230408-9234-42c6-a0bd-02dc41b86b4f.png) ![loss_4epoch](https://user-images.githubusercontent.com/52192467/149062897-0b3f1d1b-718d-4cbd-a983-4ea17abd0dcf.png)

I'd like to test if it's the data that causes the iteration times to be very different in my training process than in your case. It will be great if...