nequip
nequip copied to clipboard
Multi-GPU support exists❓ [QUESTION]
We are interested in training nequip potentials on large datasets of several million structures. Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning. best regards and thank you very much, Jonathan Ps: this might be related to #126
Hi @JonathanSchmidt1 ,
Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods).
Re multi-GPU training: I have a draft branch horovod
using the Horovod distributed training framework. This is an in-progress draft, and has only been successfully tested so for for a few epochs on multiple CPUs. The branch is also a little out-of-sync with the latest version, but I will try to merge that back in in the coming days. If you are interested, you are more than welcome to use this branch, just understanding that it would as a sort of an "alpha tester." If you do use the branch, please carefully check any results you get for sanity and against those with Horovod disabled, and also please report any issues/suspicions here or by email. (One disclaimer is that the horovod
branch is not a development priority for us this summer and I will likely be slow to respond.) PRs are also welcome, though I appreciate people reaching out to discuss first if the PR involves major development or restructuring.
PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important nequip
features, such as correctly calculated and averaged metrics, careful data normalization, EMA, correct global numerical precision and JIT settings, etc., etc. would be difficult and involve a lot of subtle stumbling blocks we have already dealt with in the nequip
code. For this reason I would really recommend against this path unless you want to deal carefully with all of this. (If you do, of course, it would be great if you could share that work!)
Thanks!
OK, I've merged the latest develop
-> horovod
, see https://github.com/mir-group/nequip/pull/211.
If you try this, please run the Horovod unit tests tests/integration/test_train_horovod.py
and confirm that they (1) are not skipped (i.e. horovod is installed) and (2) pass.
thank you very much. I will see how it goes.
As usual, other things got in the way but I could finally test it. Running tests/integration/test_train_horovod.py worked. I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml).
Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics. I checked the torch_runstats lib and could not find any get_state, are you maybe using a modified version?
Epoch batch loss loss_f f_mae f_rmse
0 1 1.06 1.06 24.3 32.5
Traceback (most recent call last):
File "/home/test_user/.conda/envs/nequip2/bin/nequip-train", line 33, in
Hi @JonathanSchmidt1 ,
Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄
Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in torch_runstats
yet; you can find it on the branch https://github.com/mir-group/pytorch_runstats/tree/state-reduce.
Thank you that fixed it for one gpu.
horovodrun -np 1 nequip-train configs/example.yaml --horovod
works now.
If I use two gpus we get an error message as some tensors during the metric evaluation are on the wrong devices.
File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 993, in epoch_step
[1,0]
I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 . Not sure how it's supposed to be. Are they all expected to be on cuda:0 for this step or all on their own gpu?
Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance.
That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.
I thought reviving the issues might be more convenient than continuing by email. So some quick notes about some issues I noticed when testing the ddp branch.
-
Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.
-
Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible. -WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215968 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215970 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215971 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -15) local_rank: 1 (pid: 215969) of binary: /home/test_user/.conda/envs/nequip2/bin/python Traceback (most recent call last): File "/home/test_user/.conda/envs/nequip2/bin/torchrun", line 8, in
sys.exit(main()) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: /home/test_user/.conda/envs/nequip2/bin/nequip-train FAILED Failures: <NO_OTHER_FAILURES> Root Cause (first observed failure): [0]: time : 2023-03-21_21:38:56 host : dgx2 rank : 1 (local_rank: 1) exitcode : -15 (pid: 215969) error_file: <N/A> traceback : Signal 15 (SIGTERM) received by PID 215969 /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' -
At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:
0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB | | 0 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB | | 1 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB | | 2 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | ......
Hi @JonathanSchmidt1 ,
Thanks!
Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.
Hm yes... this one will be a little nontrivial, since need to not only prevent wandb
init on other ranks but probably also sync the wandb updated config to the nonzero ranks.
Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.
Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.
At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:
Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel"
Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu.
The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp: 0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |
I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster.
I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu.
Hi all,
Any updates on this feature? I also have some rather large datasets.
Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes. N_nodes (1 P100 per node) [1, 2, 4, 8, 16, 32] [1.0, 1.6286277105250644, 3.3867286549788127, 6.642094103901569, 9.572247883815873, 17.38443770824977] ps: I did not confirm whether the loss is the same for different node numbers yet for HOROVOD
Hi @JonathanSchmidt1,
Did you also receive a message like this when using the horovod branch on 2 gpus:
[1,0]<stderr>:Processing dataset...
[1,1]<stderr>:Processing dataset...
The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training. Ps: I have tested some of the models now and the loss reported during training seems correct.
Hi,
I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ?
Hi @sklenard,
I am trying to utilizing the multi-GPU feature, but I have some trouble with it.
I install the ddp branch with pytorch 2.1.1 by changing
"torch>=1.8,<=1.12,!=1.9.0", # torch.fx added in 1.8
to "torch>=1.8,<=2.1.1,!=1.9.0", # torch.fx added in 1.8
in ''setup.py'' nequip folder.
in this way, ddp branch can be installed without any error. However, when I try to run nequip-train, i get this error:
[W init.cpp:842] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 2) (function operator())
Traceback (most recent call last):
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/bin/nequip-train", line 8, in <module>
sys.exit(main())
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 76, in main
trainer = fresh_start(config)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 189, in fresh_start
config = init_n_update(config)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/utils/wandb.py", line 17, in init_n_update
wandb.init(
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
raise e
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1177, in init
wi.setup(kwargs)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 190, in setup
self._wl = wandb_setup.setup(settings=setup_settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
ret = _setup(settings=settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
wl = _WandbSetup(settings=settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
_WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
self._setup()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
self._setup_manager()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
self._manager = wandb_manager._Manager(settings=self._settings)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 139, in __init__
self._service.start()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 250, in start
self._launch_server()
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 244, in _launch_server
_sentry.reraise(e)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 154, in reraise
raise exc.with_traceback(sys.exc_info()[2])
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 242, in _launch_server
self._wait_for_ports(fname, proc=internal_proc)
File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 132, in _wait_for_ports
raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.
it seems that there is something wrong with wandb. I wonder how you install this branch, Maybe there is some difference between the version you installed and I installed since more than 2 months had passed. It would be great if you could recall and tell how you installed or share the version you installed. Thank you very much!
@beidouamg this looks like a network error unrelated to the ddp
branch, but maybe there is a race condition. Have you tried to run without wandb
enabled?