llama-recipes
llama-recipes copied to clipboard
NCCL communicator error: Socket Timeout when finetuning 70B model on 2 * (8* A100(80G))
When fine-tuning the 70b model, I always run into an error while loading the model. Usually, after loading 4 to 10 shards (totally15 shards), the following error occurs(see Error Message). I'm using two nodes, and on the first GPU of the first node, the memory usage is always a bit lower, as shown in the image below.
Error Message:
Warning: unknown parameter local_rank
Clearing GPU cache for all ranks
--> Running with torch dist debug set to detailWarning: unknown parameter l[rank14]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
Traceback (most recent call last):
File "examples/finetuning.py", line 8, in
GPU Usage:
Every time, the first gpu of master node only using 3m until crashed.
Env
cuda-python 11.7.0+0.g95a2041.dirty cupy-cuda118 11.0.0 dask-cuda 22.10.0a0+23.g62a1ee8 nvidia-dali-cuda110 1.20.0 pytorch-quantization 2.1.2 pytorch-triton 2.1.0+6e4932cda8 torch 2.2.0.dev20231116+cu118 torch-tensorrt 1.3.0a0 torchaudio 2.2.0.dev20231116+cu118 torchdata 0.6.1 torchtext 0.15.2+cpu torchvision 0.17.0.dev20231116+cu118 transformers 4.35.0
Training script
export NCCL_IB_HCA=mlx5 export NCCL_IB_TC=136 export NCCL_IB_SL=5 export NCCL_IB_GID_INDEX=3 export NCCL_SOCKET_IFNAME=bond0 export NCCL_DEBUG=INFO ... cd /llama-recipes
torchrun --nproc_per_node=${KUBERNETES_CONTAINER_RESOURCE_GPU}
--master_addr=${MASTER_ADDR}
--master_port=${MASTER_PORT}
--nnodes=${WORLD_SIZE}
--node_rank=${RANK}
examples/finetuning.py
--enable_fsdp
--low_cpu_fsdp
--fsdp_config.pure_bf16
--model_name /airoboros-l2-70b-2.1
--batch_size_training 1
--dist_checkpoint_root_folder /checkpoints
--dist_checkpoint_folder fine-tuned
--dataset "alpaca_dataset" 2>&1 | tee t44_lr.log
Has anyone else encountered a similar problem? Do you know what might be causing this? Thanks.
@yguo33 I wonder if you run into the same issue with a slurm script as well?
I ran on single node with 16xA100-40GB and having same issue
torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
@HamidShojanazeri , I run with slurm on a compute cluster with 4 nodes (8 A100). I face the same issue. Note that it happens for 70B model with low_cpu_fsdp, it does not happen for smaller model 7B, 13B (with low_cpu_fsdp).
Apparently, I needed to export following NCCL env variable in slurm submission script:
export NCCL_ASYNC_ERROR_HANDLING=1
this fixed the NCCL socket time out issue in my case.
@avanindra thanks for the update, @giaosudau @yguo33 does that work for you too?
@HamidShojanazeri, this still happens to me even after i use export NCCL_ASYNC_ERROR_HANDLING=1
Hi @HamidShojanazeri
I am also seeing this issue. I have tried both export NCCL_ASYNC_ERROR_HANDLING=1
and export TORCH_NCCL_ASYNC_ERROR_HANDLING=1
but I still get the error:
torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Any thoughts?
@tginart can you please share a repro, your command, your env ( some specifications) + GPU type.
@yguo33 hello, same issue, is your issue solved?
set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart
set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart
RuntimeError: Using sequence parallelism requires setting the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1, when i set CUDA_DEVICE_MAX_CONNECTIONS to 32, it rases a new error
@tginart please let us know if you would still be interested in sharing some more details for us to repro. Thanks!