torchft icon indicating copy to clipboard operation
torchft copied to clipboard

ProcessGroupBabyNCCL - EOFError

Open btian opened this issue 1 month ago • 0 comments

I saw this error when replacing ProcessGroupNCCL with ProcessGroupBabyNCCL with train_ddp.py.

ProcessGroupNCCL and ProcessGroupGloo work fine. How can I debug this?

ERROR:torchft.manager:[<replica_id>/0 - step 0] got exception in future -- skipping remaining: Got the following error when running the callback: AssertionError: <EMPTY MESSAGE>
  | 14:21:16.477-07:00 |  
  | 14:21:16.477-07:00 | At:
  | 14:21:16.477-07:00 | /usr/local/lib/python3.12/dist-packages/torchft/manager.py(1183): callback
  | 14:21:16.477-07:00 | /usr/lib/python3.12/threading.py(1010): run
  | 14:21:16.477-07:00 | /usr/lib/python3.12/threading.py(1030): _bootstrap
  | 14:21:16.477-07:00 | File "/usr/local/lib/python3.12/dist-packages/torchft/manager.py", line 508, in callback
  | 14:21:16.477-07:00 | return fut.value()
  | 14:21:16.477-07:00 | ^^^^^^^^^^^
  | 14:21:16.477-07:00 | File "/usr/local/lib/python3.12/dist-packages/torch/futures/__init__.py", line 102, in value
  | 14:21:16.477-07:00 | return super().value()
  | 14:21:16.477-07:00 | ^^^^^^^^^^^^^^^
  | 14:21:16.477-07:00 | File "/usr/local/lib/python3.12/dist-packages/torch/futures/__init__.py", line 275, in raise_error
  | 14:21:16.477-07:00 | raise fut_result
  | 14:21:16.477-07:00 | timed_fut.set_result(fut.wait())
  | 14:21:16.477-07:00 | ^^^^^^^^^^
  | 14:21:16.477-07:00 | RuntimeError: Got the following error when running the callback: AssertionError: <EMPTY MESSAGE>
  | 14:21:16.477-07:00 |  
  | 14:21:16.477-07:00 | /usr/local/lib/python3.12/dist-packages/torchft/manager.py(1183): callback
  | 14:21:16.477-07:00 | /usr/local/lib/python3.12/dist-packages/torch/futures/__init__.py(249): set_result
  | 14:21:16.477-07:00 | /usr/local/lib/python3.12/dist-packages/torchft/process_group.py(1600): _future_handler
  | 14:21:16.477-07:00 | /usr/lib/python3.12/threading.py(1073): _bootstrap_inner
  | 14:21:16.477-07:00 | /usr/lib/python3.12/threading.py(1030): _bootstrap
  | 14:21:16.477-07:00 |  
  | 14:21:16.518-07:00 | Traceback (most recent call last):
  | 14:21:16.518-07:00 | File "/code/train_ddp.py", line 212, in <module>
  | 14:21:16.518-07:00 | main()
  | 14:21:16.518-07:00 | File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
  | 14:21:16.518-07:00 | return f(*args, **kwargs)
  | 14:21:16.518-07:00 | ^^^^^^^^^^^^^^^^^^
  | 14:21:16.518-07:00 | File "/code/train_ddp.py", line 186, in main
  | 14:21:16.518-07:00 | loss.backward()
  | 14:21:16.518-07:00 | File "/usr/local/lib/python3.12/dist-packages/torch/_tensor.py", line 648, in backward
  | 14:21:16.519-07:00 | torch.autograd.backward(
  | 14:21:16.519-07:00 | File "/usr/local/lib/python3.12/dist-packages/torch/autograd/__init__.py", line 353, in backward
  | 14:21:16.519-07:00 | _engine_run_backward(
  | 14:21:16.519-07:00 | File "/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py", line 824, in _engine_run_backward
  | 14:21:16.519-07:00 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | 14:21:16.519-07:00 | File "/usr/local/lib/python3.12/dist-packages/torchft/ddp.py", line 78, in _comm_hook
  | 14:21:16.519-07:00 | assert fut._fut
  | 14:21:16.541-07:00 | Traceback (most recent call last):
  | 14:21:16.541-07:00 | File "/usr/local/lib/python3.12/dist-packages/torchft/process_group.py", line 1577, in _future_handler
  | 14:21:16.541-07:00 | cmd = future_pipe.recv(timedelta(seconds=10))
  | 14:21:16.541-07:00 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  | 14:21:16.541-07:00 | File "/usr/local/lib/python3.12/dist-packages/torchft/multiprocessing.py", line 21, in recv
  | 14:21:16.541-07:00 | out = self._pipe.recv()
  | 14:21:16.541-07:00 | File "/usr/lib/python3.12/multiprocessing/connection.py", line 250, in recv
  | 14:21:16.541-07:00 | buf = self._recv_bytes()
  | 14:21:16.541-07:00 | File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
  | 14:21:16.541-07:00 | buf = self._recv(4)
  | 14:21:16.541-07:00 | ^^^^^^^^^^^^^
  | 14:21:16.541-07:00 | File "/usr/lib/python3.12/multiprocessing/connection.py", line 399, in _recv
  | 14:21:16.541-07:00 | raise EOFError
  |  

btian avatar Oct 16 '25 21:10 btian