PiPPy llama example hangs on V100

llama example hangs on V100

Open fblagojevic opened this issue 11 months ago • 1 comments

Llama example works fine when run with 2 GPUs: torchrun --nproc-per-node 2 pippy_llama.py output: ['know', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']

However, the example hangs when run 4 or 8 GPUs: torchrun --nproc-per-node 4 pippy_llama.py

I am running this on a node node with 8xV100 GPUs, cuda 12.3. When running with 4 processes, nvidia-smi output shows that the GPU1 is 100% utilized while the utilization on other GPUs is 0%.

Any ideas? Thanks!

Mar 19 '24 18:03 fblagojevic

Sorry I cannot reproduce the hang on my system (8xA100).

$ torchrun --standalone --nproc-per-node 4 pippy_llama.py
Downloading shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  7.94it/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  7.99it/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  4.77it/s]
Downloading shards: 100%|█████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  2.76it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:18<00:00,  9.36s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:20<00:00, 10.28s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.54s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:21<00:00, 10.54s/it]
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']

$ torchrun --standalone --nproc-per-node 8 pippy_llama.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.93s/it]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████| 2/2 [00:15<00:00,  7.92s/it]
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
NCCL version 2.19.3+cuda12.3
['make', 'think', 'you', 'be', 'getting', 'great', 'favorite', 'right']

Mar 20 '24 14:03 kwen2501

PiPPy PiPPy copied to clipboard

llama example hangs on V100

PiPPy
PiPPy copied to clipboard