torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

Results 50 torchft issues
Sort by recently updated
recently updated
newest added

Hi, I was following the guide in README to run torchft locally. ``` # start lighthouse RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 # start a replica in another...

# Maintain Constant Global Batch Size Upon Failure With the current implementation of `DistributedSampler`, the `global_batch_size` is `group_batch_size * num_replica_group`. It may be more preferable if the DistributedSampler is implemented...

The current `train_ddp.py` has two problems: * It cannot guarantee the sequential reading of each sample. For example, the replica group world size is 3, but only 2 replicas are...

CLA Signed

By default, the current training generates too many logs. Here, some less important logs are updated to debug. If necessary, we can enable these logs by setting `RUST_LOG=DEBUG` and adjusting...

CLA Signed

When `self._pg.allreduce([tensor], opts)` throws an exception, it returns `_DummyWork`, which is different from the normally returned `_ManagedWork`. This will cause the process to exit due to the following error. ```...

CLA Signed

Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted. To reproduce: Use the following...

Summary: as title Differential Revision: D86800665

CLA Signed
fb-exported
meta-exported

Differential Revision: D86343575

CLA Signed
fb-exported
meta-exported

I saw this error when replacing ProcessGroupNCCL with ProcessGroupBabyNCCL with train_ddp.py. ProcessGroupNCCL and ProcessGroupGloo work fine. How can I debug this? ``` ERROR:torchft.manager:[/0 - step 0] got exception in future...

When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g., ``` optimizer_state_dict = get_optimizer_state_dict( model=self._model, optimizers=self._optimizer, options=StateDictOptions( full_state_dict=True, cpu_offload=True, ), ) ``` instead of `optimizer_state_dict = self._optimizer.state_dict()`, TorchFT gets stuck at manager.py `should_commit()` method. Why...