When fine-tuning the 70b model, I always run into an error while loading the model. Usually, after loading 4 to 10 shards (totally15 shards), the following error occurs(see Error Message). I'm using two nodes, and on the first GPU of the first node, the memory usage is always a bit lower, as shown in the image below.

Error Message:

Warning: unknown parameter local_rank Clearing GPU cache for all ranks --> Running with torch dist debug set to detailWarning: unknown parameter l[rank14]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success). Traceback (most recent call last): File "examples/finetuning.py", line 8, in fire.Fire(main) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/usr/local/lib/python3.8/dist-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/data/user/llama-recipes/src/llama_recipes/finetuning.py", line 325, in main model = FSDP( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 476, in init _auto_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap _recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type] File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap wrapped_child, num_wrapped_params = _recursive_wrap( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap return wrapper_cls(module, **kwargs) File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 502, in init _init_param_handle_from_module( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module _sync_module_params_and_buffers( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers _sync_params_and_buffers( File "/usr/local/lib/python3.8/dist-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers dist._broadcast_coalesced( torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

GPU Usage: Every time, the first gpu of master node only using 3m until crashed. img_v3_0259_f1274ba2-21c7-4f00-921a-0464b49ef2eg

Env

cuda-python 11.7.0+0.g95a2041.dirty cupy-cuda118 11.0.0 dask-cuda 22.10.0a0+23.g62a1ee8 nvidia-dali-cuda110 1.20.0 pytorch-quantization 2.1.2 pytorch-triton 2.1.0+6e4932cda8 torch 2.2.0.dev20231116+cu118 torch-tensorrt 1.3.0a0 torchaudio 2.2.0.dev20231116+cu118 torchdata 0.6.1 torchtext 0.15.2+cpu torchvision 0.17.0.dev20231116+cu118 transformers 4.35.0

Training script

export NCCL_IB_HCA=mlx5 export NCCL_IB_TC=136 export NCCL_IB_SL=5 export NCCL_IB_GID_INDEX=3 export NCCL_SOCKET_IFNAME=bond0 export NCCL_DEBUG=INFO ... cd /llama-recipes

torchrun --nproc_per_node=${KUBERNETES_CONTAINER_RESOURCE_GPU}
--master_addr=${MASTER_ADDR}
--master_port=${MASTER_PORT}
--nnodes=${WORLD_SIZE}
--node_rank=${RANK}
examples/finetuning.py
--enable_fsdp
--low_cpu_fsdp
--fsdp_config.pure_bf16
--model_name /airoboros-l2-70b-2.1
--batch_size_training 1
--dist_checkpoint_root_folder /checkpoints
--dist_checkpoint_folder fine-tuned
--dataset "alpaca_dataset" 2>&1 | tee t44_lr.log

Has anyone else encountered a similar problem? Do you know what might be causing this? Thanks.

Nov 17 '23 15:11 yguo33

@yguo33 I wonder if you run into the same issue with a slurm script as well?

Nov 20 '23 14:11 HamidShojanazeri

I ran on single node with 16xA100-40GB and having same issue

torch.distributed.DistBackendError: [9] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Nov 22 '23 16:11 giaosudau

@HamidShojanazeri , I run with slurm on a compute cluster with 4 nodes (8 A100). I face the same issue. Note that it happens for 70B model with low_cpu_fsdp, it does not happen for smaller model 7B, 13B (with low_cpu_fsdp).

Nov 24 '23 12:11 avanindra

Apparently, I needed to export following NCCL env variable in slurm submission script:

export NCCL_ASYNC_ERROR_HANDLING=1

this fixed the NCCL socket time out issue in my case.

Nov 24 '23 19:11 avanindra

@avanindra thanks for the update, @giaosudau @yguo33 does that work for you too?

Nov 28 '23 17:11 HamidShojanazeri

@HamidShojanazeri, this still happens to me even after i use export NCCL_ASYNC_ERROR_HANDLING=1

Jan 05 '24 01:01 xuefeicao

Hi @HamidShojanazeri

I am also seeing this issue. I have tried both export NCCL_ASYNC_ERROR_HANDLING=1 and export TORCH_NCCL_ASYNC_ERROR_HANDLING=1 but I still get the error:

torch.distributed.DistBackendError: [14] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout

Any thoughts?

Apr 15 '24 02:04 tginart

@tginart can you please share a repro, your command, your env ( some specifications) + GPU type.

Apr 16 '24 23:04 HamidShojanazeri

@yguo33 hello, same issue, is your issue solved?

May 07 '24 09:05 gonggaohan

set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart

Jul 22 '24 06:07 congcongke

set "CUDA_DEVICE_MAX_CONNECTIONS" to 32 maybe you need in environment. pls have a try @yguo33 @gonggaohan @tginart

RuntimeError: Using sequence parallelism requires setting the environment variable CUDA_DEVICE_MAX_CONNECTIONS to 1, when i set CUDA_DEVICE_MAX_CONNECTIONS to 32, it rases a new error

Aug 19 '24 04:08 terminator123

@tginart please let us know if you would still be interested in sharing some more details for us to repro. Thanks!

Aug 19 '24 18:08 init27

llama-recipes
llama-recipes copied to clipboard

NCCL communicator error: Socket Timeout when finetuning 70B model on 2 * (8* A100(80G))

Error Message:

Env

Training script

llama-recipes llama-recipes copied to clipboard

NCCL communicator error: Socket Timeout when finetuning 70B model on 2 * (8* A100(80G))

Error Message:

Env

Training script

llama-recipes
llama-recipes copied to clipboard