torchft
torchft copied to clipboard
Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)
Hi, I was following the guide in README to run torchft locally. ``` # start lighthouse RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 # start a replica in another...
# Maintain Constant Global Batch Size Upon Failure With the current implementation of `DistributedSampler`, the `global_batch_size` is `group_batch_size * num_replica_group`. It may be more preferable if the DistributedSampler is implemented...
The current `train_ddp.py` has two problems: * It cannot guarantee the sequential reading of each sample. For example, the replica group world size is 3, but only 2 replicas are...
By default, the current training generates too many logs. Here, some less important logs are updated to debug. If necessary, we can enable these logs by setting `RUST_LOG=DEBUG` and adjusting...
When `self._pg.allreduce([tensor], opts)` throws an exception, it returns `_DummyWork`, which is different from the normally returned `_ManagedWork`. This will cause the process to exit due to the following error. ```...
Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted. To reproduce: Use the following...
Summary: as title Differential Revision: D86800665
I saw this error when replacing ProcessGroupNCCL with ProcessGroupBabyNCCL with train_ddp.py. ProcessGroupNCCL and ProcessGroupGloo work fine. How can I debug this? ``` ERROR:torchft.manager:[/0 - step 0] got exception in future...
When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g., ``` optimizer_state_dict = get_optimizer_state_dict( model=self._model, optimizers=self._optimizer, options=StateDictOptions( full_state_dict=True, cpu_offload=True, ), ) ``` instead of `optimizer_state_dict = self._optimizer.state_dict()`, TorchFT gets stuck at manager.py `should_commit()` method. Why...