accelerate
accelerate copied to clipboard
Communication/NCCL failures training FSDP in multi-node environment with SLURM
System Info
output of `accelerate env`: (note as shown below this prints the DEFAULT accelerate config and not the exact config being used for this job)
- `Accelerate` version: 0.27.2
- Platform: Linux-5.15.0-1037-aws-x86_64-with-glibc2.17
- Python version: 3.8.18
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.2.0+cu121 (False)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 123.82 GB
- `Accelerate` default config:
- compute_environment: LOCAL_MACHINE
- distributed_type: FSDP
- mixed_precision: bf16
- use_cpu: False
- debug: False
- num_processes: 8
- machine_rank: 0
- num_machines: 2
- main_process_ip:
- main_process_port: 1234
- rdzv_backend: static
- same_network: True
- main_training_function: main
- fsdp_config: {'fsdp_auto_wrap_policy': 'TRANSFORMER_BASED_WRAP', 'fsdp_backward_prefetch': 'BACKWARD_PRE', 'fsdp_cpu_ram_efficient_loading': True, 'fsdp_forward_prefetch': False, 'fsdp_offload_params': False, 'fsdp_sharding_strategy': 'FULL_SHARD', 'fsdp_state_dict_type': 'FULL_STATE_DICT', 'fsdp_sync_module_states': True, 'fsdp_transformer_layer_cls_to_wrap': 'LlamaDecoderLayer', 'fsdp_use_orig_params': False}
- downcast_bf16: no
- tpu_use_cluster: False
- tpu_use_sudo: False
- tpu_env: []
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
I'm attempting to train a model with multi-node training, using SLURM scheduler. I am launching the job in 2 nodes with 8 GPUs each. My training script runs fine in a single-node environment with FSDP, and it starts fine in the multi-node setting -- until there is actual communication required between the nodes.
However, when the script gets to the parts that actually initialize multi-node training, it seems the processes are having issues communicating across nodes. I can see the logging output from all 16 processes, the data is loaded, etc. However, the script fails at accelerator.prepare().
Specifically I see the stack trace containing these lines (complete stack trace is below):
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Note that it is possible I have misconfigured the accelerate config, or the SLURM settings (tasks/node counts etc), but based on the example here with corresponding FSDP config here this seems to be set up correctly to me.
Any thoughts would be appreciated, I've tried lots of different configurations and tinkering with the environment to make sure the versions of pytorch/NCCL/accelerate are all compatible as well.
Contents of fsdp_config_base.yaml I am using:
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: 1
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
mixed_precision: bf16
rdzv_backend: c10d
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Relevant chunks of the sbatch script I am launching the job with:
#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=10
#SBATCH --partition=a40x
#SBATCH --gpus-per-node=8
#SBATCH --exclusive
#SBATCH --time=04-00:00:00
#SBATCH --account=nextgends
#SBATCH --chdir=/admin/home-jpgard/rtfm
#SBATCH --output=/admin/home-jpgard/rtfm/slurm-out/%j.out
#SBATCH --err=/admin/home-jpgard/rtfm/slurm-out/%j.err
#SBATCH --exclude=ip-10-0-201-106,ip-10-0-202-154
################# code block adapted from https://gist.github.com/pacman100/1cb1f17b2f1b3139a63b764263e70b25
set -x -e
# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1
export NCCL_DEBUG=INFO
GPUS_PER_NODE=8
NNODES=$SLURM_NNODES
NUM_PROCESSES=$(expr $NNODES \* $GPUS_PER_NODE)
# A function to parse slurm's node notation when it uses 'bracketed' values for SLURM_JOB_NODELIST
# Function to parse and expand node list from SLURM_JOB_NODELIST
expand_nodes() {
# The input is something like "ip-10-0-231-[1,86]"
local nodelist=$1
# Replace '[' and ']' with space and split the string
local base=$(echo $nodelist | sed -E 's/\[([0-9]+),([0-9]+)\]/ \1 \2 /')
# Read into array
read -a parts <<< "$base"
# Check if we have three parts: prefix, start, end
if [ ${#parts[@]} -eq 3 ]; then
local prefix=${parts[0]}
local start=${parts[1]}
local end=${parts[2]}
# Generate sequence
for i in $(seq $start $end); do
echo "${prefix}${i}"
return # Return after first IP to mimic head node behavior
done
else
# If the format does not include a range, just echo the input
echo $nodelist
fi
}
# Extract the first node name from SLURM_JOB_NODELIST
# This assumes the format "node-list: ip-10-0-209-157,ip-10-0-231-1" and extracts the first node name
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)
# Now, resolve this node name to an IP address
# Using getent ahosts (You can also use nslookup if getent does not work as expected)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')
# Check if we got an IP
if [ ! -z "$MASTER_ADDR" ]; then
echo "Head node IP: $MASTER_ADDR"
else
echo "Failed to resolve head node IP address"
# Extract the first node name from SLURM_JOB_NODELIST and expand if needed
node_name=$(expand_nodes $SLURM_JOB_NODELIST)
# Now, resolve this node name to an IP address using getent ahosts
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')
echo "Head node IP after parsing: $MASTER_ADDR"
fi
MASTER_PORT=6999
# OTHER LAUNCHERS CAN BE USED HERE
export LAUNCHER="/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate launch \
--config_file /admin/home-jpgard/rtfm/fsdp_config_base.yaml \
--num_processes $NUM_PROCESSES \
--main_process_ip $MASTER_ADDR \
--num_machines $NNODES \
--main_process_port $MASTER_PORT \
--machine_rank \$SLURM_PROCID \
"
echo "SLURM_JOB_ID is ${SLURM_JOB_ID}"
echo 'activating conda environment'
source /admin/home-jpgard/.bashrc
source /admin/home-jpgard/miniconda3/etc/profile.d/conda.sh
which conda
conda activate rtfm
which python
export PROGRAM="\
scripts/train.py \
--more-args-here
--bf16 True \
--use_amp \
"
export CMD="$LAUNCHER $PROGRAM"
echo "about to run ${CMD}"
/opt/slurm/bin/srun --jobid $SLURM_JOBID /usr/bin/bash -c "$CMD"
Full stack trace:
Traceback (most recent call last):
File "scripts/train.py", line 582, in <module>
main(
File "scripts/train.py", line 206, in main
model = accelerator.prepare(model)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model
model = FSDP(model, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
_auto_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module
_sync_module_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
_sync_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers
dist._broadcast_coalesced(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
socketStartConnect: Connect to fe80::a849:bbff:fe73:19bd%veth8299cd8<54785> failed : Software caused connection abort
Traceback (most recent call last):
File "scripts/train.py", line 582, in <module>
main(
File "scripts/train.py", line 206, in main
model = accelerator.prepare(model)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1228, in prepare
result = tuple(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1229, in <genexpr>
self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1105, in _prepare_one
return self.prepare_model(obj, device_placement=device_placement)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/accelerator.py", line 1387, in prepare_model
model = FSDP(model, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 477, in __init__
_auto_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_wrap_utils.py", line 101, in _auto_wrap
_recursive_wrap(**recursive_wrap_kwargs, **root_kwargs) # type: ignore[arg-type]
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 543, in _recursive_wrap
wrapped_child, num_wrapped_params = _recursive_wrap(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 561, in _recursive_wrap
return _wrap(module, wrapper_cls, **kwargs), nonwrapped_numel
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/wrap.py", line 490, in _wrap
return wrapper_cls(module, **kwargs)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 503, in __init__
_init_param_handle_from_module(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 587, in _init_param_handle_from_module
_sync_module_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/fsdp/_init_utils.py", line 1068, in _sync_module_params_and_buffers
_sync_params_and_buffers(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/utils.py", line 303, in _sync_params_and_buffers
dist._broadcast_coalesced(
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1691, unhandled system error (run with NCCL_DEBUG=INFO for details), NCCL version 2.19.3
ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error.
Last error:
/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead
warnings.warn(
[2024-02-16 02:42:00,874] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025500 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025502 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025503 closing signal SIGTERM
[2024-02-16 02:42:00,875] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025504 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025505 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025506 closing signal SIGTERM
[2024-02-16 02:42:00,876] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 4025507 closing signal SIGTERM
/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/utils/launch.py:192: FutureWarning: `fsdp_backward_prefetch_policy` is deprecated and will be removed in version 0.27.0 of 🤗 Accelerate. Use `fsdp_backward_prefetch` instead
warnings.warn(
[2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267868 closing signal SIGTERM
[2024-02-16 02:42:00,885] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267870 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267872 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267873 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267874 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267875 closing signal SIGTERM
[2024-02-16 02:42:00,886] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1267876 closing signal SIGTERM
[2024-02-16 02:42:03,298] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 1267869) of binary: /admin/home-jpgard/miniconda3/envs/rtfm/bin/python
Traceback (most recent call last):
File "/admin/home-jpgard/miniconda3/envs/rtfm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1010, in launch_command
multi_gpu_launcher(args)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/accelerate/commands/launch.py", line 672, in multi_gpu_launcher
distrib_run.run(args)
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-16_02:42:00
host : ip-10-0-209-157.us-west-2.compute.internal
rank : 9 (local_rank: 1)
exitcode : 1 (pid: 1267869)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Expected behavior
I expect training to work in the distributed setting just as it does in the single-node setting.
A couple of additional comments:
- I've tried using torchrun as the launcher instead (with
export LAUNCHER="NCCL_DEBUG=INFO torchrun --nproc_per_node=$GPUS_PER_NODE --nnodes=$NNODES --node_rank=\$SLURM_PROCID --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT ") but it raises the same error - I am not 100% clear on whether NUM_PROCESSES should be set to 8 (the number of GPUs per node) or 16 (the total number of processes); I can't find this clearly documented in Accelerate either but may have missed something; the docs say "The total number of processes to be launched in parallel" but I suppose this could be in parallel on one node, or on all nodes.
What kind of GPUs are these?
They are 40GB A100s, in nodes of 8
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi, I have the same problem. Did you find any way to fix it?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.