nequip icon indicating copy to clipboard operation
nequip copied to clipboard

Multi-GPU support exists❓ [QUESTION]

Open JonathanSchmidt1 opened this issue 2 years ago • 25 comments

We are interested in training nequip potentials on large datasets of several million structures. Consequently we wanted to know whether multi-gpu support exists or if someone knows whether the networks can be integrated into pytorch lightning. best regards and thank you very much, Jonathan Ps: this might be related to #126

JonathanSchmidt1 avatar May 11 '22 14:05 JonathanSchmidt1

Hi @JonathanSchmidt1 ,

Thanks for your interest in our code/method for your project! Sounds like an interesting application; please feel free to get in touch by email and let us know how it's going (we're always interested to hear about what people are working on using our methods).

Re multi-GPU training: I have a draft branch horovod using the Horovod distributed training framework. This is an in-progress draft, and has only been successfully tested so for for a few epochs on multiple CPUs. The branch is also a little out-of-sync with the latest version, but I will try to merge that back in in the coming days. If you are interested, you are more than welcome to use this branch, just understanding that it would as a sort of an "alpha tester." If you do use the branch, please carefully check any results you get for sanity and against those with Horovod disabled, and also please report any issues/suspicions here or by email. (One disclaimer is that the horovod branch is not a development priority for us this summer and I will likely be slow to respond.) PRs are also welcome, though I appreciate people reaching out to discuss first if the PR involves major development or restructuring.

PyTorch Lightning is a lot more difficult to integrate with. Getting a simple training loop going would be easy, but it would use a different configuration file, and integrating it with the full set of important nequip features, such as correctly calculated and averaged metrics, careful data normalization, EMA, correct global numerical precision and JIT settings, etc., etc. would be difficult and involve a lot of subtle stumbling blocks we have already dealt with in the nequip code. For this reason I would really recommend against this path unless you want to deal carefully with all of this. (If you do, of course, it would be great if you could share that work!)

Thanks!

Linux-cpp-lisp avatar May 12 '22 20:05 Linux-cpp-lisp

OK, I've merged the latest develop -> horovod, see https://github.com/mir-group/nequip/pull/211.

Linux-cpp-lisp avatar May 12 '22 21:05 Linux-cpp-lisp

If you try this, please run the Horovod unit tests tests/integration/test_train_horovod.py and confirm that they (1) are not skipped (i.e. horovod is installed) and (2) pass.

Linux-cpp-lisp avatar May 13 '22 01:05 Linux-cpp-lisp

thank you very much. I will see how it goes.

JonathanSchmidt1 avatar May 16 '22 13:05 JonathanSchmidt1

As usual, other things got in the way but I could finally test it. Running tests/integration/test_train_horovod.py worked. I also confirmed that the normal training on gpu worked (nequip-train configs/minimal.yaml).

Now if I run with --horovod the training of the first epoch seems fine but there is a problem with the metrics. I checked the torch_runstats lib and could not find any get_state, are you maybe using a modified version?

Epoch batch loss loss_f f_mae f_rmse 0 1 1.06 1.06 24.3 32.5 Traceback (most recent call last): File "/home/test_user/.conda/envs/nequip2/bin/nequip-train", line 33, in sys.exit(load_entry_point('nequip', 'console_scripts', 'nequip-train')()) File "/raid/scratch/testuser/nequip/nequip/scripts/train.py", line 87, in main trainer.train() File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 827, in train self.epoch_step() File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 991, in epoch_step self.metrics.gather() File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 274, in gather { File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in k1: {k2: rs.get_state() for k2, rs in v1.items()} File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 275, in k1: {k2: rs.get_state() for k2, rs in v1.items()} AttributeError: 'RunningStats' object has no attribute 'get_state'

JonathanSchmidt1 avatar Jul 16 '22 15:07 JonathanSchmidt1

Hi @JonathanSchmidt1 ,

Surprised that the tests run if the training won't... that sounds like a sign that the tests are broken 😄

Whoops yes I forgot to mention, I haven't merged the code I was writing to enable multi-GPU training in torch_runstats yet; you can find it on the branch https://github.com/mir-group/pytorch_runstats/tree/state-reduce.

Linux-cpp-lisp avatar Jul 16 '22 16:07 Linux-cpp-lisp

Thank you that fixed it for one gpu. horovodrun -np 1 nequip-train configs/example.yaml --horovod works now. If I use two gpus we get an error message as some tensors during the metric evaluation are on the wrong devices. File "/raid/scratch/testuser/nequip/nequip/train/trainer.py", line 993, in epoch_step [1,0]: self.metrics.gather() [1,0]: File "/raid/scratch/testuser/nequip/nequip/train/metrics.py", line 288, in gather [1,0]: self.running_stats[k1][k2].accumulate_state(rs_state) [1,0]: File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch_runstats/_runstats.py", line 331, in accumulate_state [1,0]: self._state += n * (state - self._state) / (self._n + n) [1,0]:RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

I checked and "n" and "state" are on cuda:1 and "self._state", "self._n" are on cuda:0 . Not sure how it's supposed to be. Are they all expected to be on cuda:0 for this step or all on their own gpu?

JonathanSchmidt1 avatar Jul 16 '22 22:07 JonathanSchmidt1

Aha... here's that "this is very untested" 😁 I think PyTorch / Horovod may be too smart for its own good and reloading transmitted tensors onto different CUDA devices when they are all available to the same host... I will look into this when I get a chance.

Linux-cpp-lisp avatar Jul 17 '22 08:07 Linux-cpp-lisp

That would be great, I will also try to find the time to look into it but I think I will need some time to understand the whole codebase.

JonathanSchmidt1 avatar Jul 17 '22 19:07 JonathanSchmidt1

I thought reviving the issues might be more convenient than continuing by email. So some quick notes about some issues I noticed when testing the ddp branch.

  • Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

  • Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible. -WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215968 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215970 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 215971 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -15) local_rank: 1 (pid: 215969) of binary: /home/test_user/.conda/envs/nequip2/bin/python Traceback (most recent call last): File "/home/test_user/.conda/envs/nequip2/bin/torchrun", line 8, in sys.exit(main()) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/test_user/.conda/envs/nequip2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: /home/test_user/.conda/envs/nequip2/bin/nequip-train FAILED Failures: <NO_OTHER_FAILURES> Root Cause (first observed failure): [0]: time : 2023-03-21_21:38:56 host : dgx2 rank : 1 (local_rank: 1) exitcode : -15 (pid: 215969) error_file: <N/A> traceback : Signal 15 (SIGTERM) received by PID 215969 /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /home/test_user/.conda/envs/nequip2/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

  • At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

    0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB | | 0 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 0 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB | | 1 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 1 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB | | 2 N/A N/A 804404 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804405 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804406 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804407 C ...a/envs/nequip2/bin/python 1499MiB | | 2 N/A N/A 804408 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804401 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804402 C ...a/envs/nequip2/bin/python 1499MiB | | 3 N/A N/A 804403 C ...a/envs/nequip2/bin/python 1499MiB | ......

JonathanSchmidt1 avatar Mar 22 '23 11:03 JonathanSchmidt1

Hi @JonathanSchmidt1 ,

Thanks!

Every process seems to get its own wandb log. It's not possible to restart because wandb finds an existing run in each process and then crashes.

Hm yes... this one will be a little nontrivial, since need to not only prevent wandb init on other ranks but probably also sync the wandb updated config to the nonzero ranks.

Sometimes random crash after a few 100 epochs have no idea yet why. Was also not reproducible.

Weird... usually when we see something like this it means out-of-memory, or that the cluster's scheduler went crazy.

At the moment each process seems to load the network on each gpu e.g. running with 8 gpus I get this output from nvidia-smi:

Not sure exactly what I'm looking at here, but yes, every GPU will get its own copy of the model as hinted by the name "Distributed Data Parallel"

Linux-cpp-lisp avatar Mar 28 '23 22:03 Linux-cpp-lisp

Out of memory errors could make sense and might be connected to the last issue as with the same batch size per GPU I did not produce OOM errors when running on a single gpu.

The output basically says that each worker process uses up memory (most likely a copy of the model) on each gpu, however with DDP each worker is supposed to have a copy only on its own gpu. Then gradient updates are sent all-to-all. Basically I would expect the output to look like this from previous experience with ddp: 0 N/A N/A 804401 C ...a/envs/nequip2/bin/python 18145MiB | 1 N/A N/A 804402 C ...a/envs/nequip2/bin/python 19101MiB | 2 N/A N/A 804403 C ...a/envs/nequip2/bin/python 17937MiB |

JonathanSchmidt1 avatar Mar 29 '23 07:03 JonathanSchmidt1

I'd also be very interested in this feature. I have access to a system with four A100s on each node. Being able to use all four would make training go a lot faster.

peastman avatar Mar 29 '23 19:03 peastman

I spend some time debugging the issue and it seems that the metrics.gather and loss.gather calls cause the extra processes to spawn. If I remove these calls it's only one process per gpu and I can scale to 16 gpus (before it would run oom because of the extra processes). However continuing the training after stopping still somehow causes extra processes to spawn but just on the zeroth gpu.

JonathanSchmidt1 avatar Apr 05 '23 19:04 JonathanSchmidt1

Hi all,

Any updates on this feature? I also have some rather large datasets.

rschireman avatar Jul 26 '23 22:07 rschireman

Just a small update. As I had access to a different cluster with HOROVOD I tested the horovod branch again and with the fixed runstats version and a few small changes it ran without the issues of the ddp version. I also got descent speedups, despite using single gpu nodes. N_nodes (1 P100 per node) [1, 2, 4, 8, 16, 32] [1.0, 1.6286277105250644, 3.3867286549788127, 6.642094103901569, 9.572247883815873, 17.38443770824977] ps: I did not confirm whether the loss is the same for different node numbers yet for HOROVOD

JonathanSchmidt1 avatar Sep 08 '23 15:09 JonathanSchmidt1

Hi @JonathanSchmidt1,

Did you also receive a message like this when using the horovod branch on 2 gpus:

[1,0]<stderr>:Processing dataset...
[1,1]<stderr>:Processing dataset...

rschireman avatar Sep 29 '23 17:09 rschireman

The dataset processing only seems to happen in process for me, so I only get the message once. Anyway if that is causing problems for you it might work to process the dataset before and then start the training. Ps: I have tested some of the models now and the loss reported during training seems correct.

JonathanSchmidt1 avatar Oct 27 '23 10:10 JonathanSchmidt1

Hi,

I am also quite interested in the multi-GPU training capbility. I did some tests with the ddp branch using PyTorch 2.1.1 up to 16 GPUs (4 V100 per node) on a dataset with ~5k configurations. In all my tests I achieved the same results compared to a single GPU reference. I was wondering whether this feature is still under active development and if there is any plan to merge it with the develop branch ?

sklenard avatar Feb 09 '24 18:02 sklenard

Hi @sklenard,

I am trying to utilizing the multi-GPU feature, but I have some trouble with it. I install the ddp branch with pytorch 2.1.1 by changing "torch>=1.8,<=1.12,!=1.9.0", # torch.fx added in 1.8 to "torch>=1.8,<=2.1.1,!=1.9.0", # torch.fx added in 1.8 in ''setup.py'' nequip folder.

in this way, ddp branch can be installed without any error. However, when I try to run nequip-train, i get this error:

[W init.cpp:842] Warning: Use _jit_set_fusion_strategy, bailout depth is deprecated. Setting to (STATIC, 2) (function operator())
Traceback (most recent call last):
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/bin/nequip-train", line 8, in <module>
    sys.exit(main())
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 76, in main
    trainer = fresh_start(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/scripts/train.py", line 189, in fresh_start
    config = init_n_update(config)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/nequip/utils/wandb.py", line 17, in init_n_update
    wandb.init(
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1200, in init
    raise e
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 1177, in init
    wi.setup(kwargs)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_init.py", line 190, in setup
    self._wl = wandb_setup.setup(settings=setup_settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 327, in setup
    ret = _setup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 320, in _setup
    wl = _WandbSetup(settings=settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 303, in __init__
    _WandbSetup._instance = _WandbSetup__WandbSetup(settings=settings, pid=pid)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 114, in __init__
    self._setup()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 250, in _setup
    self._setup_manager()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_setup.py", line 277, in _setup_manager
    self._manager = wandb_manager._Manager(settings=self._settings)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/wandb_manager.py", line 139, in __init__
    self._service.start()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 250, in start
    self._launch_server()
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 244, in _launch_server
    _sentry.reraise(e)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/analytics/sentry.py", line 154, in reraise
    raise exc.with_traceback(sys.exc_info()[2])
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 242, in _launch_server
    self._wait_for_ports(fname, proc=internal_proc)
  File "/global/homes/.local/Miniconda3/envs/nequip-ddp/lib/python3.10/site-packages/wandb/sdk/service/service.py", line 132, in _wait_for_ports
    raise ServiceStartTimeoutError(
wandb.sdk.service.service.ServiceStartTimeoutError: Timed out waiting for wandb service to start after 30.0 seconds. Try increasing the timeout with the `_service_wait` setting.

it seems that there is something wrong with wandb. I wonder how you install this branch, Maybe there is some difference between the version you installed and I installed since more than 2 months had passed. It would be great if you could recall and tell how you installed or share the version you installed. Thank you very much!

beidouamg avatar Apr 25 '24 03:04 beidouamg

@beidouamg this looks like a network error unrelated to the ddp branch, but maybe there is a race condition. Have you tried to run without wandb enabled?

Linux-cpp-lisp avatar May 01 '24 22:05 Linux-cpp-lisp