llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

Unspecified Launch Failure Error

Open awsankur opened this issue 1 year ago • 0 comments

System Info

PyTorch version: 2.1.1+cu121 CUDA used to build PyTorch: 12.1 GPUs: NVIDIA A100-SXM4-80GB Nodes: 2 GPUs per node: 8 NCCL version: 2.18.6 Python: 3.10

Installed llama like below: pip3 install --extra-index-url https://download.pytorch.org/whl/test/cu121 llama-recipes

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

multi-node.slurm script used to submit the job:

#!/bin/bash

#SBATCH --job-name=Nano-2d-trainer-20b-8nodes

#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --gpus-per-task=8

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address)
# Enable for A100
export FI_PROVIDER="efa"

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
# debugging flags (optional)
export NCCL_DEBUG=INFO
##export NCCL_DEBUG_SUBSYS=WARN
export PYTHONFAULTHANDLER=1
export LD_LIBRARY_PATH=/opt/amazon/efa/lib:$LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib/:$LD_LIBRARY_PATH
export CUDA_LAUNCH_BLOCKING=0

export FI_EFA_FORK_SAFE=1
export FI_EFA_ENABLE_SHM_TRANSFER=1

srun  torchrun --nnodes 1 --nproc_per_node 8 --rdzv_id $RANDOM --rdzv_backend c10d --rdzv_endpoint $head_node_ip:29500 /apps/llama-recipes/llama-recipes/examples/finetuning.py \
  --model_name /fsx/llama-2-7b-hf \
  --output_dir /fsx/llama-2-7b-peft \
  --enable_fsdp \
  --use_peft \
  --peft_method lora \
  --pure_bf16

Training fails randomly at different steps in a different epoch every time with the error:

NCCL watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

On my 2 node p4de cluster, training works when using just one 1 node. NCCL tests also run successfully. Training also works correctly on a 2 node P5 cluster.

Error logs

Training Epoch: 2/3, step 13/24 completed (loss: 1.7051671743392944): : 91it [28:39, 11.34s/it]
Training Epoch: 2/3, ompleted (loss: 1.7000443935394287): : 78it [26:38, 12.51s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.575779676437378): : 78it [26:38, 12.52s/it] 
Training Epoch: 2/3, step 12/24 completed (loss: 1.6573039293289185): : 91it [28:38, 11.33s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.719298005104065): : 91it [28:38, 11.33s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.6742596626281738): : 91it [28:38, 11.34s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.575779676437378): : 91it [28:38, 11.34s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.707750916481018): : 91it [28:38, 11.34s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.7000443935394287): : 91it [28:39, 11.34s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.6420879364013672): : 91it [28:38, 11.34s/it]
Training Epoch: 2/3, step 12/24 completed (loss: 1.641650915145874): : 91it [28:38, 11.34s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.718770980834961): : 91it [28:39, 11.33s/it] 
Trainstep 13/24 completed (loss: 1.6281288862228394): : 91it [28:39, 11.34s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.6770853996276855): : 91it [28:39, 11.34s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.7179162502288818): : 91it [28:39, 11.33s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.6623927354812622): : 91it [28:39, 11.33s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.6608721017837524): : 91it [28:39, 11.33s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.6101438999176025): : 91it [28:39, 11.34s/it]
Training Epoch: 2/3, step 13/24 completed (loss: 1.642388105392456): : 91it [28:39, 11.33s/it] [E ProcessGroupNCCL.cpp:916] [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbc51fac617 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbc51f6798d in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbc5205d118 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fbc5339ad40 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fbc5339eb68 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7fbc533b5400 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fbc533b5708 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fbcb3e9cdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fbcde4ca609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fbcde295133 in /lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'std::runtime_error'
  what():  [Rank 0] NCCL watchdog thread terminated with exception: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fbc51fac617 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fbc51f6798d in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fbc5205d118 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x80 (0x7fbc5339ad40 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x58 (0x7fbc5339eb68 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x250 (0x7fbc533b5400 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #6: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x78 (0x7fbc533b5708 in /apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0xd6df4 (0x7fbcb3e9cdf4 in /lib/x86_64-linux-gnu/libstdc++.so.6)
frame #8: <unknown function> + 0x8609 (0x7fbcde4ca609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #9: clone + 0x43 (0x7fbcde295133 in /lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007fbbee94f700 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 320 in wait
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/multiprocessing/queues.py", line 231 in _feed
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 953 in run
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbbbafbd700 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/selectors.py", line 416 in select
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/multiprocessing/connection.py", line 931 in wait
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/multiprocessing/connection.py", line 424 in _poll
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/multiprocessing/connection.py", line 257 in poll
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/multiprocessing/queues.py", line 113 in get
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 31 in do_one_step
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/utils/data/_utils/pin_memory.py", line 54 in _pin_memory_loop
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 953 in run
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbbbb7be700 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/psutil/_common.py", line 484 in wrapper
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/llama_recipes/utils/memory_utils.py", line 29 in cpu_mem_used
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/llama_recipes/utils/memory_utils.py", line 35 in peak_monitor_func
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 953 in run
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbbbcfc1700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbbd7c2700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbbdfc3700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbbe7c4700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbbefc5700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbbf7c6700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbbffc7700 (most recent call first):
  <no Python frame>

Thread 0x00007fbbc2ecf700 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/cuda/streams.py", line 221 in synchronize
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 345 in _unshard
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 464 in _pre_forward_unshard
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/distributed/fsdp/_runtime_utils.py", line 429 in _pre_forward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 825 in forward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/peft/tuners/lora.py", line 908 in forward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 366 in forward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 672 in forward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1194 in recompute_fn
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 1071 in unpack_hook

Thread 0x00007fbbef150700 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 324 in wait
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 607 in wait
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbbfced3700 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 324 in wait
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 607 in wait
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/tqdm/_monitor.py", line 60 in run
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 1016 in _bootstrap_inner
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/threading.py", line 973 in _bootstrap

Thread 0x00007fbcde1754c0 (most recent call first):
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/autograd/__init__.py", line 251 in backward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/torch/_tensor.py", line 492 in backward
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/llama_recipes/utils/train_utils.py", line 92 in train
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/llama_recipes/finetuning.py", line 237 in main
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/fire/core.py", line 691 in _CallAndUpdateTrace
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/fire/core.py", line 475 in _Fire
  File "/apps/.conda/envs/llama-cu21/lib/python3.10/site-packages/fire/core.py", line 141 in Fire
  File "/apps/llama-recipes/llama-recipes/examples/finetuning.py", line 8 in <module>

Expected behavior

Training should complete successfully without errors. These numbers I got from a 2 node P5 training run

Key: avg_train_prep, Value: 5.826728343963623
Key: avg_train_loss, Value: 1.7596477270126343
Key: avg_eval_prep, Value: 5.577054500579834
Key: avg_eval_loss, Value: 1.7178523540496826
Key: avg_epoch_time, Value: 3628.8078609586664
Key: avg_checkpoint_time, Value: 58.75846792899938

awsankur avatar Nov 09 '23 00:11 awsankur