torchft issues

PGTransport in-place transfers

Currently PGTransport will allocate new tensors and copy them to CPU -- this is memory inefficient and slow as we have to limit amount of tensors transferred at once and...

d4l3k

checkpoint

python

process_group

add profiling to ProcessGroupBaby

1

Currently ProcessGroupBaby doesn't support any profiling as the `record_function` pieces will run in the subprocess. We either need to figure out some way to forward those profiling information from the...

d4l3k

python

process_group

[RFC] [Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend

4

## Abstract We propose extending TorchFT to support [Intel XPU](https://docs.pytorch.org/docs/stable/notes/get_start_xpu.html) devices by adapting ProcessGroups to use the "xccl" backend. This extension will enable Intel GPU users to benefit from the...

siju-samuel

Clarification on deployment configuration

It would be helpful to have further documentation on deployment recommendations in a production setting. For example: - Should the parameter server / lighthouse server be colocated for performance? Is...

tonyf

Fixing the issue with indentation on the landing page

svekars

CLA Signed

Pass timeout on python futures to collective libraries

If collective timeouts are different for e.g. in `gloo`, the python code will be allowed to continue because from its perspective the future has completed. But the underlying future in...

tushar00jain

Explain quorum

14

There are a lot of explanations about quorum in the design doc and in the code comments. But I did not find a place where it is explained "quorum of...

rualark

documentation

Reuse Valid Accumulated Gradients Upon Failure

4

**Description:** The current failure model in TorchFT handles node failures by zeroing out all accumulated gradients and recalculating them in the subsequent forward/backward pass. This approach, while ensuring correctness, leads...

WarrenZhu050413

[WIP] Streaming DiLoCo prototype

Creating a small script to quickly hack on the implementation for streaming DiLoCo. Run with (start lighthouse first by looking at command in README.md): `cd streaming_diloco_prototype` `torchx run` ## Issues...

H-Huang

CLA Signed

Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment

1

GH Issue: https://github.com/pytorch/torchft/issues/173 Extends Lighthouse to to support multiple independent quorums on a single server by tagging each gRPC call with a `room-id` metadata header and feeding requests through a...

MattKotzbauer

CLA Signed

torchft
torchft copied to clipboard

Metadata

PGTransport in-place transfers

add profiling to ProcessGroupBaby

[RFC] [Intel GPU] Extending TorchFT to Support Intel GPU with XCCL Backend

Clarification on deployment configuration

Fixing the issue with indentation on the landing page

Pass timeout on python futures to collective libraries

Explain quorum

Reuse Valid Accumulated Gradients Upon Failure

[WIP] Streaming DiLoCo prototype

Support multiple quorums on a single LighthouseServer using gRPC metadata-based room assignment

← Metadata

Owner

Metadata

torchft torchft copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchft
torchft copied to clipboard