torchft
torchft copied to clipboard
Example train_ddp.py breaks
Hi, I was following the guide in README to run torchft locally.
# start lighthouse
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000
# start a replica in another shell
export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2
CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py
# start another replica
export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2
CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py
After I ran
export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2
CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py
It immediately failed and I saw
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-12-02_07:33:17
host : xxxxxxxxxxxxxxxxxxxx
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 47674)
error_file: /mnt/tmp/torchelastic_4m_9eon3/none_ywq7poit/attempt_0/0/error.json
traceback : Traceback (most recent call last):
File "/miniforge/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
File "/mnt/task_runtime/train_ddp.py", line 192, in main
loss.backward()
File "/miniforge/lib/python3.10/site-packages/torch/_tensor.py", line 625, in backward
torch.autograd.backward(
File "/miniforge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 354, in backward
_engine_run_backward(
File "/miniforge/lib/python3.10/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
File "/mnt/task_runtime/torchft/ddp.py", line 78, in _comm_hook
assert fut._fut
AttributeError: 'Future' object has no attribute '_fut'
The seems only happens on the latest main: 024f850a21654afad0036cd3374d8acc3ce47935 After I reset it to 8ef24c055ebb495caf39fb2acdbddb8ebcebdf19, I no longer see this issue.
Some extra info if it helps
python -c "import torch; print(torch.__version__)"
2.9.1+cu128