torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Fix inconsistent return types.

Open zhengchenyu opened this issue 1 week ago • 0 comments

When self._pg.allreduce([tensor], opts) throws an exception, it returns _DummyWork, which is different from the normally returned _ManagedWork. This will cause the process to exit due to the following error.

[rank1]: Traceback (most recent call last):
[rank1]:   File "/xxx/cnn_train.py", line 361, in <module>
[rank1]:     train(args)
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
[rank1]:     return f(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^
[rank1]:   File "/xxx/cnn_train.py", line 246, in train
[rank1]:     train_epoch(
[rank1]:   File "/xxx/cnn_train.py", line 277, in train_epoch
[rank1]:     loss.backward()     # manager.allreduce  quorum
[rank1]:     ^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torch/_tensor.py", line 647, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torch/autograd/__init__.py", line 354, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torch/autograd/graph.py", line 829, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/opt/conda/lib/python3.11/site-packages/torchft/ddp.py", line 78, in _comm_hook
[rank1]:     assert fut._fut
[rank1]:            ^^^^^^^^
[rank1]: AttributeError: 'Future' object has no attribute '_fut'

_DummyWork has no attribute '_fut', so throw error.

zhengchenyu avatar Nov 27 '25 10:11 zhengchenyu