torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

Results 50 torchft issues
Sort by recently updated
recently updated
newest added

I'm seeing the warnings below on program exit. This PR adds the `shutdown` method to atexit handler and rewrites the shutdown logic to exit cleanly which fixes the warnings. ```...

CLA Signed

If we don't wait for the first quorum, the trainer will continue to run forward and may use incorrect weights if the trainer is healing.

CLA Signed

as titled it goes fast Test plan: Testing w/ 12 GB of 64 mb tensors baseline ``` took 30.493701454252005 seconds ``` With streaming transfer ``` 0 chunks took 8.783997897058725 seconds...

CLA Signed

This is a tracking issue for adding LocalSGD support into torchft. There's been interest in LocalSGD support and it's something we'd like to be able to support. This should be...

enhancement

# Start light house RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000 # Start worker 0 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=2,3 TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes 1 --nproc-per-node 2 train_fsdp.py # Start worker1: REPLICA_GROUP_ID=1...

CLA Signed

Test plan: ``` TORCHFT_OTEL_OTLP=http://localhost:4317 torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 --quorum_tick_ms 2000 TORCHFT_OTEL_STDOUT=1 torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 ```

CLA Signed

The test currently fails because of the lock step behavior when join timeout is too short. When join timeout is long it completes but not with the tensors one would...

CLA Signed

This is a tracking issue for dataloader improvements. The current support is very basic and we likely need to make some bigger changes to make this more efficient - [...

enhancement
data

The CheckpointServer currently uses torch.save/torch.load which requires allocating the entire buffer into memory. We want to instead use streaming transfers so we minimize the amount of CPU memory required. It...

enhancement
good first issue