torchft
torchft copied to clipboard
Added example training scripts for localsgd, DiLoCo, Live Checkpoint Recovery, and proactive failure detection with DDP (#198)
TorchFT Examples
This PR adds a comprehensive set of examples demonstrating various fault tolerance features and distributed training approaches in TorchFT. Each example includes a per example README that provide step-by-step instructions, and sample outputs to help users understand and incorporate these features into their training script. All the examples build on top of train_ddp.py.
The PR came from my own experience understanding the different features of torchFT. I found it hard to start running other features outside of the given train_ddp.py at the beginning, which made it more difficult for me to have a sense of the various features offered by torchFT.
@d4l3k provided useful feedback in how to structure the examples.
Examples Included:
-
DDP with Proactive Failure Recovery (
examples/ddp_proactive)- Demonstrates how to enable proactive detection and response to worker failures
- Includes detailed explanation of recovery mechanism with annotated logs
- Shows significant reduction in recovery time compared to timeout-based approaches
-
DiLoCo (Distributed Local Convergence) (
examples/diloco)- Implements DiLoCo training methodology
- Shows how to configure and optimize local convergence parameters
- Documents performance characteristics and tradeoffs
-
LocalSGD (
examples/localsgd)- Demonstrates LocalSGD with periodic synchronization strategy
- Provides guidance on setting appropriate synchronization frequency
- Includes performance comparison considerations
-
Live Checkpoint Recovery (
examples/live_checkpoint_recovery)- Shows how to implement checkpoint-based recovery for fault tolerance
- Documents the checkpoint storage and retrieval process
- Includes recovery time analysis and optimization tips