torchft icon indicating copy to clipboard operation
torchft copied to clipboard

Fault tolerance for PyTorch (HSDP, LocalSGD, DiLoCo, Streaming DiLoCo)

Results 50 torchft issues
Sort by recently updated
recently updated
newest added

Currently PGTransport will allocate new tensors and copy them to CPU -- this is memory inefficient and slow as we have to limit amount of tensors transferred at once and...

checkpoint
python
process_group

Currently ProcessGroupBaby doesn't support any profiling as the `record_function` pieces will run in the subprocess. We either need to figure out some way to forward those profiling information from the...

python
process_group

## Abstract We propose extending TorchFT to support [Intel XPU](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) devices by adapting ProcessGroups to use the "xccl" backend. This extension will enable Intel GPU users to benefit from the...

It would be helpful to have further documentation on deployment recommendations in a production setting. For example: - Should the parameter server / lighthouse server be colocated for performance? Is...

If collective timeouts are different for e.g. in `gloo`, the python code will be allowed to continue because from its perspective the future has completed. But the underlying future in...

There are a lot of explanations about quorum in the design doc and in the code comments. But I did not find a place where it is explained "quorum of...

documentation

**Description:** The current failure model in TorchFT handles node failures by zeroing out all accumulated gradients and recalculating them in the subsequent forward/backward pass. This approach, while ensuring correctness, leads...

Creating a small script to quickly hack on the implementation for streaming DiLoCo. Run with (start lighthouse first by looking at command in README.md): `cd streaming_diloco_prototype` `torchx run` ## Issues...

CLA Signed

GH Issue: https://github.com/pytorch/torchft/issues/173 Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a `room-id` metadata header and feeding requests through a...

CLA Signed