Context

I am a freshman using DDP, so I directly use the code in this repo. The code can run in my environment, but there is no output. The reason for ensuring the code has been executed is two points. The first bit is that I checked GPUs usage, and the 2nd bit is that a "checkpoint.pt" file has been created.

Pytorch version: 2.0.1+cu117
Operating System and version: Ubuntu 20.04.4 LTS

Environment

GPU environment:

Thu May 25 03:15:52 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA RTX A6000    Off  | 00000000:01:00.0 Off |                  Off |
| 30%   44C    P2    96W / 300W |    827MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA RTX A6000    Off  | 00000000:41:00.0 Off |                  Off |
| 30%   42C    P2    99W / 300W |    835MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA RTX A6000    Off  | 00000000:81:00.0 Off |                  Off |
| 30%   49C    P2    97W / 300W |    835MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA RTX A6000    Off  | 00000000:C1:00.0 Off |                  Off |
| 30%   43C    P2    87W / 300W |    795MiB / 49140MiB |    100%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   3767550      C   .../envs/xxxxxxxx/bin/python      825MiB |
|    1   N/A  N/A   3767551      C   .../envs/xxxxxxxx/bin/python      833MiB |
|    2   N/A  N/A   3767552      C   .../envs/xxxxxxxx/bin/python      833MiB |
|    3   N/A  N/A   3767553      C   .../envs/xxxxxxxx/bin/python      793MiB |
+-----------------------------------------------------------------------------+

Which example are you using: I am working on the "main" branch. I almost did not change any code in the file "examples/distributed/ddp-tutorial-series/multigpu.py" except adding some print statements. For example, in main function

   dataset, model, optimizer = load_train_objs()
   print("load train objs")
   train_data = prepare_dataloader(dataset, batch_size)
   print("load data")
   trainer = Trainer(model, train_data, optimizer, rank, save_every)
   print("create instance of Traniner")
   trainer.train(total_epochs)

Expected Behavior

it should finished in seconds based on my command python multiplegpu.py 10 5 > output.txt and print out something like [GPU: 0 ... [GPU 1 ... [GPU: 2 ... [GPU 3 ... into the file output.txt but the code will spend several hours without output, even without error.

May 25 '23 04:05 YihuaXuCn

Hi I came across the same issue when I run multigpu.py(your demo sample code):

please see the scree shot:

root:~/work/ $ python ./src/multigpu.py --batch_size=32 --total_epochs=30 --save_every=5

but nothing output.

From nvidia-smi output, I can see GPU is running 100%, but memory is very low, it seems GPU is hanging there, but nothing output in terminal, I also can see 2 process is running (because I set 2 GPU to use).

output of nvidia-smi:

_> +---------------------------------------------------------------------------------------+

| NVIDIA-SMI 535.104.12 Driver Version: 535.104.12 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:63:00.0 Off | 0 | | N/A 76C P0 48W / 70W | 225MiB / 15360MiB | 100% E. Process | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 1 Tesla T4 Off | 00000000:C3:00.0 Off | 0 | | N/A 66C P0 41W / 70W | 225MiB / 15360MiB | 100% E. Process | | | | N/A | +-----------------------------------------+----------------------+----------------------+ | 2 Tesla T4 Off | 00000000:E3:00.0 Off | 0 | | N/A 40C P8 11W / 70W | 5MiB / 15360MiB | 0% E. Process | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 37920 C .../.conda/envs/bin/python 220MiB | | 1 N/A N/A 37921 C .../.conda/envs/bin/python 220MiB | +---------------------------------------------------------------------------------------+_

So what's wrong? I ran other scripts, all cannot be executed like above one...

output of nvtop:

Device 0 [Tesla T4] PCIe GEN 3@16x RX: 0.000 KiB/s TX: 0.000 KiB/s Device 1 [Tesla T4] PCIe GEN 3@16x RX: 0.000 KiB/s TX: 0.000 KiB/s GPU 1590MHz MEM 5000MHz TEMP 76°C FAN N/A% POW 48 / 70 W GPU 1590MHz MEM 5000MHz TEMP 66°C FAN N/A% POW 40 / 70 W GPU[|||||||||||||||||||||||100%] MEM[| 0.639Gi/15.000Gi] GPU[|||||||||||||||||||||||100%] MEM[| 0.639Gi/15.000Gi]

Device 2 [Tesla T4] PCIe GEN 1@16x RX: 0.000 KiB/s TX: GPU 300MHz MEM 405MHz TEMP 40°C FAN N/A% POW 10 / 70 W GPU[ 0%] MEM[| 0.424Gi/15.00 ┌──────────────────── 100│GPU0 %───────────────── │GPU0 mem% │ │ 75│ │ │ │ │ 50│ │ │ │ 25│ │ │ │──────────────────── 0│ └28s──────────21s──────── 0.000 KiB/s 0Gi] ────────────────────────────────────┐ ┌────────────────────────────────────────────────────────┐ ┌────────────────────────────────────────────────────────┐ ─────────────────────────────────│ 100│GPU1 %──────────────────────────────────────────────────│ 100│GPU2 % │ │ │GPU1 mem% │ │GPU2 mem% │ │ │ │ │ │ │ │ │ │ │ │ 75│ │ 75│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 50│ │ 50│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ 25│ │ 25│ │ │ │ │ │ │ │ │ │ │ │ ────────────────────────────────────│ │────────────────────────────────────────────────────────│ │ │ │ 0│ │ 0│────────────────────────────────────────────────────────│ ───14s───────────7s───────────0s┘ └28s──────────21s───────────14s───────────7s───────────0s┘ └28s──────────21s───────────14s───────────7s───────────0s

I test many times. I think maybe DDP API has some issue or bug. Could you find any reason of this? thanks.

Dec 16 '23 08:12 asimay

and after a long time, the APP is crash:

[E ProcessGroupNCCL.cpp:475] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800263 milliseconds before timing out. [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800316 milliseconds before timing out. [E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down. [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=1800000) ran for 1800263 milliseconds before timing out.

Dec 16 '23 08:12 asimay

Traceback (most recent call last): File "/home/xxx/.local/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 74, in _wrap fn(i, *args) File "/home/xxx/work/xxx/xxx-main/src/multigpu.py", line 102, in main trainer = Trainer(model, train_data, optimizer, rank, save_every) File "/home/xxx/work/xxx/xxx-main/src/multigpu.py", line 50, in init self.model = DDP(model, device_ids=[gpu_id]) File "/home/xxx/.local/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 795, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/xxx/.local/lib/python3.9/site-packages/torch/distributed/utils.py", line 265, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: DDP expects same model across all ranks, but Rank 0 has 2 params, while rank 1 has inconsistent 0 params.

none is modified

Dec 16 '23 08:12 asimay

examples
examples copied to clipboard

no any output when I try ddp

Context

Environment

Expected Behavior

examples examples copied to clipboard

no any output when I try ddp

Context

Environment

Expected Behavior

examples
examples copied to clipboard