torchft
torchft copied to clipboard
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
I'm seeing the warnings below on program exit. This PR adds the `shutdown` method to atexit handler and rewrites the shutdown logic to exit cleanly which fixes the warnings. ```...
If we don't wait for the first quorum, the trainer will continue to run forward and may use incorrect weights if the trainer is healing.
as titled it goes fast Test plan: Testing w/ 12 GB of 64 mb tensors baseline ``` took 30.493701454252005 seconds ``` With streaming transfer ``` 0 chunks took 8.783997897058725 seconds...
This is a tracking issue for adding LocalSGD support into torchft. There's been interest in LocalSGD support and it's something we'd like to be able to support. This should be...
# Start light house RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000 # Start worker 0 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=2,3 TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes 1 --nproc-per-node 2 train_fsdp.py # Start worker1: REPLICA_GROUP_ID=1...
Test plan: ``` TORCHFT_OTEL_OTLP=http://localhost:4317 torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 --quorum_tick_ms 2000 TORCHFT_OTEL_STDOUT=1 torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 ```
The test currently fails because of the lock step behavior when join timeout is too short. When join timeout is long it completes but not with the tensors one would...
This is a tracking issue for dataloader improvements. The current support is very basic and we likely need to make some bigger changes to make this more efficient - [...
The CheckpointServer currently uses torch.save/torch.load which requires allocating the entire buffer into memory. We want to instead use streaming transfers so we minimize the amount of CPU memory required. It...