No effect from InitProcessGroupKwargs timeout
Followup from https://github.com/huggingface/accelerate/issues/2236#issuecomment-1984197082 cc @muellerzr
I'll copy main text from there, and there are some more details in discussion
System Info
- `Accelerate` version: 0.23.0
- Platform: Linux-6.2.0-37-generic-x86_64-with-glibc2.35
- Python version: 3.10.13
- Numpy version: 1.26.2
- PyTorch version (GPU?): 2.1.1+cu121 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 62.65 GB
- GPU type: NVIDIA RTX 6000 Ada Generation
- `Accelerate` default config:
Not found
Reproduction
- Follow instructions from https://github.com/huggingface/alignment-handbook/tree/main/scripts. Install the environment to run lora sft training
- Change the timeout to 3 hours:
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])
and run the training 3. Get crash due to timeout: https://wandb.ai/evgeniizh/huggingface/runs/pskgg48d
[E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:489] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:495] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:916] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1124292, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800584 milliseconds before timing out.
[2023-12-09 08:46:08,664] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 54784 closing signal SIGTERM
[2023-12-09 08:46:11,834] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 1 (pid: 54785) of binary: /home/evgenii/.conda/envs/handbook/bin/python
Traceback (most recent call last):
File "/home/evgenii/.conda/envs/handbook/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 971, in launch_command
deepspeed_launcher(args)
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/accelerate/commands/launch.py", line 687, in deepspeed_launcher
distrib_run.run(args)
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/evgenii/.conda/envs/handbook/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
======================================================
scripts/run_sft.py FAILED
Note that timeout is still 1800 secconds (see also https://github.com/huggingface/alignment-handbook/issues/59)
Expected behavior
Timeout is increased, and no crush.
Actually, @Randl when are you doing this in your code?
accelerator = Accelerator(kwargs_handlers=[InitProcessGroupKwargs(timeout=timedelta(seconds=6 * 1800))])
And what is your full code?
(I still think it may be a TRL issue, but I need that to be 100% sure)
I may have found the solution.
@Randl can you try again (I know it'll take awhile to run), installing transformers via pip install git+https://github.com/huggingface/transformers@muellerzr-fix-timeout?
Finally narrowed it down.
I'll update you when I run it
@Randl were you able to try it out? 🤗
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Wondering if this was addressed?
Wondering if this was addressed?
@thepowerfuldeez Wondering how was this addressed?