Jonathan Schmidt comments

Results 24 comments of


                                            Jonathan Schmidt

Multi-GPU support exists❓ [QUESTION]

thank you very much. I will see how it goes.

Multi-GPU support exists❓ [QUESTION]

As usual, other things got in the way but I could finally test it. Running tests/integration/test_train_horovod.py worked. I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml). Now...

Multi-GPU support exists❓ [QUESTION]

Thank you that fixed it for one gpu. horovodrun -np 1 nequip-train configs/example.yaml --horovod works now. If I use two gpus we get an error message as some tensors during...

Multi-GPU support exists❓ [QUESTION]

That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.

Multi-GPU support exists❓ [QUESTION]

I thought reviving the issues might be more convenient than continuing by email. So some quick notes about some issues I noticed when testing the ddp branch. - Every process...

Multi-GPU support exists❓ [QUESTION]

Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when...

Multi-GPU support exists❓ [QUESTION]

I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one...

Multi-GPU support exists❓ [QUESTION]

Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small...

Multi-GPU support exists❓ [QUESTION]

The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to...

Running ALIGNN on Multi-GPUs

We are also interested in training on some larger datasets. How is the state of distributed training right now?