ControlNet icon indicating copy to clipboard operation
ControlNet copied to clipboard

RuntimeError: Timed out initializing process group in store based barrier on rank: 0

Open shravankumar147 opened this issue 2 years ago • 0 comments

Unable to run training on 4 gpus, where I can run the same on single GPU with batch size =1 .

#trainer params
batch_size = 1  #tried 4 but it did not work
n_gpus = 4
n_epochs = 2
strategy= "ddp" #"dp"|"ddp"|"ddp2"

trainer = pl.Trainer(gpus=n_gpus, precision=32, callbacks=[logger],max_epochs=n_epochs, strategy=strategy)

Observing this error:

RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=4, worker_count=2, timeout=0:30:00)

This is my GPU setup:

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                        On | 00000000:00:04.0 Off |                    0 |
| N/A   36C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla T4                        On | 00000000:00:05.0 Off |                    0 |
| N/A   40C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla T4                        On | 00000000:00:06.0 Off |                    0 |
| N/A   38C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla T4                        On | 00000000:00:07.0 Off |                    0 |
| N/A   38C    P8                9W /  70W|      2MiB / 15360MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

shravankumar147 avatar May 03 '23 12:05 shravankumar147