trl Problem in dpo_llama2.py? NCCL timeout?

i use the same code with examples/research_projects/stack_llama_2/scripts/dpo_llama2.py however, i always got problem when mapping data in DPOTrainer?

Map:  17%|███████████████████████████████████████████                                                                                                                                                                                                                  | 281398/1652614 [09:56<53:07, 430.16 examples/s][rank1]:[W socket.cpp:432] [c10d] While waitForInput, poolFD failed with (errno: 0 - Success).
Traceback (most recent call last):
  File "/data1/djl/dpo/dpo_qwen2.py", line 245, in <module>
    dpo_trainer = DPOTrainer(
                  ^^^^^^^^^^^
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/huggingface_hub/utils/_deprecation.py", line 101, in inner_f
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/trl/trainer/dpo_trainer.py", line 527, in __init__
    with PartialState().local_main_process_first():
  File "/home/djl/.conda/envs/dpo/lib/python3.11/contextlib.py", line 137, in __enter__
    return next(self.gen)
           ^^^^^^^^^^^^^^
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/accelerate/state.py", line 523, in local_main_process_first
    yield from self._goes_first(self.is_local_main_process)
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/accelerate/state.py", line 385, in _goes_first
    self.wait_for_everyone()
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/accelerate/state.py", line 379, in wait_for_everyone
    torch.distributed.barrier()
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 72, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 3439, in barrier
    work = default_pg.barrier(opts=opts)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: [1] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c10d key-value store by key '0', but store->get('0') got error: Socket Timeout
Exception raised from doWait at ../torch/csrc/distributed/c10d/TCPStore.cpp:550 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fed7d519d87 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x15c0e57 (0x7fedb11fbe57 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #2: c10d::TCPStore::doGet(std::string const&) + 0x32 (0x7fedb54cace2 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #3: c10d::TCPStore::get(std::string const&) + 0xa1 (0x7fedb54cbb11 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #4: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb5480f81 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #5: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb5480f81 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #6: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb5480f81 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #7: c10d::PrefixStore::get(std::string const&) + 0x31 (0x7fedb5480f81 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: c10d::ProcessGroupNCCL::broadcastUniqueNCCLID(ncclUniqueId*, bool, std::string const&, int) + 0xa9 (0x7fed7e6c1c69 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #9: c10d::ProcessGroupNCCL::getNCCLComm(std::string const&, std::vector<c10::Device, std::allocator<c10::Device> > const&, c10d::OpType, int, bool) + 0x22b (0x7fed7e6c8c5b in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #10: <unknown function> + 0x10ad03d (0x7fed7e6d203d in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #11: c10d::ProcessGroupNCCL::allreduce_impl(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x21 (0x7fed7e6d38e1 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #12: c10d::ProcessGroupNCCL::allreduce(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::AllreduceOptions const&) + 0x3bf (0x7fed7e6d58ff in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #13: c10d::ProcessGroupNCCL::barrier(c10d::BarrierOptions const&) + 0xb0e (0x7fed7e6e4d4e in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cuda.so)
frame #14: <unknown function> + 0x5838a22 (0x7fedb5473a22 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #15: <unknown function> + 0x5843740 (0x7fedb547e740 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #16: <unknown function> + 0x5843845 (0x7fedb547e845 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x4e893cc (0x7fedb4ac43cc in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #18: <unknown function> + 0x1a08a88 (0x7fedb1643a88 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x584ce54 (0x7fedb5487e54 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0x584dc05 (0x7fedb5488c05 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #21: <unknown function> + 0xc9e478 (0x7fedc7d37478 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #22: <unknown function> + 0x416234 (0x7fedc74af234 in /home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #23: /home/djl/.conda/envs/dpo/bin/python() [0x528767]
frame #24: _PyObject_MakeTpCall + 0x26c (0x5041ac in /home/djl/.conda/envs/dpo/bin/python)
frame #25: _PyEval_EvalFrameDefault + 0x6a7 (0x5116e7 in /home/djl/.conda/envs/dpo/bin/python)
frame #26: _PyFunction_Vectorcall + 0x173 (0x538cc3 in /home/djl/.conda/envs/dpo/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x47a9 (0x5157e9 in /home/djl/.conda/envs/dpo/bin/python)
frame #28: /home/djl/.conda/envs/dpo/bin/python() [0x5e079a]
frame #29: _PyEval_EvalFrameDefault + 0x32c8 (0x514308 in /home/djl/.conda/envs/dpo/bin/python)
frame #30: /home/djl/.conda/envs/dpo/bin/python() [0x5a2c87]
frame #31: _PyEval_EvalFrameDefault + 0x146a (0x5124aa in /home/djl/.conda/envs/dpo/bin/python)
frame #32: /home/djl/.conda/envs/dpo/bin/python() [0x557caf]
frame #33: _PyEval_EvalFrameDefault + 0x4b32 (0x515b72 in /home/djl/.conda/envs/dpo/bin/python)
frame #34: _PyFunction_Vectorcall + 0x173 (0x538cc3 in /home/djl/.conda/envs/dpo/bin/python)
frame #35: PyObject_Call + 0x12c (0x542bec in /home/djl/.conda/envs/dpo/bin/python)
frame #36: _PyEval_EvalFrameDefault + 0x47a9 (0x5157e9 in /home/djl/.conda/envs/dpo/bin/python)
frame #37: _PyFunction_Vectorcall + 0x173 (0x538cc3 in /home/djl/.conda/envs/dpo/bin/python)
frame #38: /home/djl/.conda/envs/dpo/bin/python() [0x5400d2]
frame #39: _PyObject_MakeTpCall + 0x233 (0x504173 in /home/djl/.conda/envs/dpo/bin/python)
frame #40: _PyEval_EvalFrameDefault + 0x6a7 (0x5116e7 in /home/djl/.conda/envs/dpo/bin/python)
frame #41: /home/djl/.conda/envs/dpo/bin/python() [0x5cbeda]
frame #42: PyEval_EvalCode + 0x9f (0x5cb5af in /home/djl/.conda/envs/dpo/bin/python)
frame #43: /home/djl/.conda/envs/dpo/bin/python() [0x5ec6a7]
frame #44: /home/djl/.conda/envs/dpo/bin/python() [0x5e8240]
frame #45: /home/djl/.conda/envs/dpo/bin/python() [0x5fd192]
frame #46: _PyRun_SimpleFileObject + 0x19f (0x5fc55f in /home/djl/.conda/envs/dpo/bin/python)
frame #47: _PyRun_AnyFileObject + 0x43 (0x5fc283 in /home/djl/.conda/envs/dpo/bin/python)
frame #48: Py_RunMain + 0x2ee (0x5f6efe in /home/djl/.conda/envs/dpo/bin/python)
frame #49: Py_BytesMain + 0x39 (0x5bbc79 in /home/djl/.conda/envs/dpo/bin/python)
frame #50: <unknown function> + 0x2d210 (0x7fedc9334210 in /usr/lib64/libc.so.6)
frame #51: __libc_start_main + 0x7c (0x7fedc93342bc in /usr/lib64/libc.so.6)
frame #52: /home/djl/.conda/envs/dpo/bin/python() [0x5bbac3]
. This may indicate a possible application crash on rank 0 or a network set up issue.
Map:  17%|███████████████████████████████████████████▎                                                                                                                                                                                                                 | 283195/1652614 [10:00<42:41, 534.67 examples/s][2024-08-06 10:15:05,584] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 340536 closing signal SIGTERM
[2024-08-06 10:15:08,253] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 1 (pid: 340537) of binary: /home/djl/.conda/envs/dpo/bin/python
Traceback (most recent call last):
  File "/home/djl/.conda/envs/dpo/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 46, in main
    args.func(args)
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1066, in launch_command
    multi_gpu_launcher(args)
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/djl/.conda/envs/dpo/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
dpo_qwen2.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-08-06_10:15:05
  host      : a62
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 340537)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Aug 06 '24 02:08 Dmm2584v

I also have this problem, how to solve it?

Aug 29 '24 13:08 workmistm

Have the same problem, and increasing NCCL timeout threshold works for me.

import torch.distributed as dist
from datetime import timedelta

dist.init_process_group(backend='nccl', init_method='env://', timeout=timedelta(hours=2))

Sep 30 '24 06:09 kygguo

Have the same problem, and increasing NCCL timeout threshold works for me.
import torch.distributed as dist
from datetime import timedelta

dist.init_process_group(backend='nccl', init_method='env://', timeout=timedelta(hours=2))

Thanks! Also works for me.

Oct 23 '24 07:10 YangZeyu95