accelerate NCCL error during saving checkpoint with ds zero3

System Info

Copy-and-paste the text below in your GitHub issue

- `Accelerate` version: 0.30.1
- Platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.31
- `accelerate` bash location: /home/rubickjiang/anaconda3/envs/deepspeed/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.26.4
- PyTorch version (GPU?): 2.3.1 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- PyTorch MLU available: False
- System RAM: 503.59 GB
- GPU type: NVIDIA GeForce RTX 4090
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: DEEPSPEED
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - enable_cpu_affinity: False
        - deepspeed_config: {'deepspeed_config_file': '/home/rubickjiang/test/ds_config.json', 'zero3_init_flag': True}
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

I am using ds zero3 to train my model, the accelerate config is: default_config.txt the ds config is: ds_config.json I train the model by using trainer = MyTrainer( modelL, training_args, train_dataset=train_dataset, ) and trainer.train() I start my script by accelerate launch my_script.py I set some global variables: export NCCL_DEBUG=INFO export NCCL_P2P_DISABLE=1 export NCCL_IB_DISABLE=1 export TORCH_NCCL_ENABLE_MONITORING=0 The training process is fine until one epoch is finished but get stuck when trying to save a checkpoint. After a long waiting, I got the error message below:

Expected behavior

The process goes well

Jul 04 '24 06:07 rubickkcibur

Is your model loaded on a single GPU ? (I know that you are using DEEPSPEED stage 3 which loads model params on different nodes but i just wanted to make sure.)

I would also suggest to switch debug to True in the yaml config to increase your chances of solving the problem.

Jul 04 '24 12:07 NotTheStallion

No, I loaded my model on multiple GPUs
after switch debug to true, I do find more information:

[rank1]:[E ProcessGroupNCCL.cpp:1537] [PG 0 Rank 1] Timeout at NCCL work: 6075, last enqueued NCCL work: 6139, last completed NCCL work: 6074. [rank1]:[E ProcessGroupNCCL.cpp:577] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank1]:[E ProcessGroupNCCL.cpp:583] [Rank 1] To avoid data inconsistency, we are taking the entire process down. [rank1]:[E ProcessGroupNCCL.cpp:1414] [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6075, OpType=_ALLGATHER_BASE, NumelIn=1366, NumelOut=4098, Timeout(ms)=1800000) ran for 2954391 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1716905971132/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc41835c897 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc419627d12 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc41962cb30 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc41962de7c in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7fc479ba4e95 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x8609 (0x7fc48ddac609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fc48db77353 in /lib/x86_64-linux-gnu/libc.so.6) terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG 0 Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6075, OpType=_ALLGATHER_BASE, NumelIn=1366, NumelOut=4098, Timeout(ms)=1800000) ran for 2954391 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1716905971132/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:565 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc41835c897 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fc419627d12 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x1a0 (0x7fc41962cb30 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fc41962de7c in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xd3e95 (0x7fc479ba4e95 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x8609 (0x7fc48ddac609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #6: clone + 0x43 (0x7fc48db77353 in /lib/x86_64-linux-gnu/libc.so.6) Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1716905971132/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1418 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fc41835c897 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: + 0xe2bccb (0x7fc4192b1ccb in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: + 0xd3e95 (0x7fc479ba4e95 in /home/rubickjiang/anaconda3/envs/deepspeed/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #3: + 0x8609 (0x7fc48ddac609 in /lib/x86_64-linux-gnu/libpthread.so.0) frame #4: clone + 0x43 (0x7fc48db77353 in /lib/x86_64-linux-gnu/libc.so.6)

But I still can't figure out why this happens

Jul 08 '24 10:07 rubickkcibur

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Aug 03 '24 15:08 github-actions[bot]

Did you solve this issue? I am loading my model on multiple GPUs with deepspeed zero3 and get the same error during saving.

Nov 20 '24 07:11 sasaadi

I finally used FSDP and it works.

compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'no' fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: true fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false machine_rank: $node_rank main_process_ip: main-node main_process_port: 5000 main_training_function: main mixed_precision: 'no' num_machines: 4 num_processes: 4 rdzv_backend: static same_network: true tpu_env: [] tpu_use_cluster: false tpu_use_sudo: false use_cpu: false

I used the following configuration. I can provide you with more details if needed.

Nov 21 '24 20:11 NotTheStallion