torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198)

Open WarrenZhu050413 opened this issue 6 months ago • 2 comments

TorchFT Examples

This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of train_ddp.py.

The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given train_ddp.py at the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.

@d4l3k provided useful feedback in how to structure the examples.

Examples Included:

  1. DDP with Proactive Failure Recovery (examples/ddp_proactive)

    • Demonstrates how to enable proactive detection and response to worker failures
    • Includes detailed explanation of recovery mechanism with annotated logs
    • Shows significant reduction in recovery time compared to timeout-based approaches
  2. DiLoCo (Distributed Local Convergence) (examples/diloco)

    • Implements DiLoCo training methodology
    • Shows how to configure and optimize local convergence parameters
    • Documents performance characteristics and tradeoffs
  3. LocalSGD (examples/localsgd)

    • Demonstrates LocalSGD with periodic synchronization strategy
    • Provides guidance on setting appropriate synchronization frequency
    • Includes performance comparison considerations
  4. Live Checkpoint Recovery (examples/live_checkpoint_recovery)

    • Shows how to implement checkpoint-based recovery for fault tolerance
    • Documents the checkpoint storage and retrieval process
    • Includes recovery time analysis and optimization tips

WarrenZhu050413 avatar May 22 '25 04:05 WarrenZhu050413