torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Example train_ddp.py breaks

Open kasakun opened this issue 5 days ago • 0 comments

Hi, I was following the guide in README to run torchft locally.

# start lighthouse
RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000

# start a replica in another shell
export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

# start another replica
export REPLICA_GROUP_ID=1
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29601 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

After I ran

export REPLICA_GROUP_ID=0
export NUM_REPLICA_GROUPS=2

CUDA_VISIBLE_DEVICES=0 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --master_port=29600 --nnodes=1 --nproc_per_node=1 -- train_ddp.py

It immediately failed and I saw

------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-12-02_07:33:17
  host      : xxxxxxxxxxxxxxxxxxxx
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 47674)
  error_file: /mnt/tmp/torchelastic_4m_9eon3/none_ywq7poit/attempt_0/0/error.json
  traceback : Traceback (most recent call last):
    File "/miniforge/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
      return f(*args, **kwargs)
    File "/mnt/task_runtime/train_ddp.py", line 192, in main
      loss.backward()
    File "/miniforge/lib/python3.10/site-packages/torch/_tensor.py", line 625, in backward
      torch.autograd.backward(
    File "/miniforge/lib/python3.10/site-packages/torch/autograd/__init__.py", line 354, in backward
      _engine_run_backward(
    File "/miniforge/lib/python3.10/site-packages/torch/autograd/graph.py", line 841, in _engine_run_backward
      return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
    File "/mnt/task_runtime/torchft/ddp.py", line 78, in _comm_hook
      assert fut._fut
  AttributeError: 'Future' object has no attribute '_fut'

The seems only happens on the latest main: 024f850a21654afad0036cd3374d8acc3ce47935 After I reset it to 8ef24c055ebb495caf39fb2acdbddb8ebcebdf19, I no longer see this issue.

Some extra info if it helps

python -c "import torch; print(torch.__version__)"
2.9.1+cu128

kasakun avatar Dec 02 '25 07:12 kasakun