ControlNet
ControlNet copied to clipboard
RuntimeError: Timed out initializing process group in store based barrier on rank: 0
Unable to run training on 4 gpus, where I can run the same on single GPU with batch size =1 .
#trainer params
batch_size = 1 #tried 4 but it did not work
n_gpus = 4
n_epochs = 2
strategy= "ddp" #"dp"|"ddp"|"ddp2"
trainer = pl.Trainer(gpus=n_gpus, precision=32, callbacks=[logger],max_epochs=n_epochs, strategy=strategy)
Observing this error:
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)
This is my GPU setup:
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 On | 00000000:00:04.0 Off | 0 |
| N/A 36C P8 9W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 Tesla T4 On | 00000000:00:05.0 Off | 0 |
| N/A 40C P8 11W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 2 Tesla T4 On | 00000000:00:06.0 Off | 0 |
| N/A 38C P8 9W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 3 Tesla T4 On | 00000000:00:07.0 Off | 0 |
| N/A 38C P8 9W / 70W| 2MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+