Jonathan Schmidt
Jonathan Schmidt
thank you very much. I will see how it goes.
As usual, other things got in the way but I could finally test it. Running tests/integration/test_train_horovod.py worked. I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml). Now...
Thank you that fixed it for one gpu. horovodrun -np 1 nequip-train configs/example.yaml --horovod works now. If I use two gpus we get an error message as some tensors during...
That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.
I thought reviving the issues might be more convenient than continuing by email. So some quick notes about some issues I noticed when testing the ddp branch. - Every process...
Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when...
I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one...
Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small...
The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to...
We are also interested in training on some larger datasets. How is the state of distributed training right now?