torchft issues

[WIP Fix pipe close warnings

I'm seeing the warnings below on program exit. This PR adds the `shutdown` method to atexit handler and rewrites the shutdown logic to exit cleanly which fixes the warnings. ```...

H-Huang

CLA Signed

Disable async quorum for the first quorum sync

1

If we don't wait for the first quorum, the trainer will continue to run forward and may use incorrect weights if the trainer is healing.

fegin

CLA Signed

make torchft work for llama3_8b 8x

as titled it goes fast Test plan: Testing w/ 12 GB of 64 mb tensors baseline ``` took 30.493701454252005 seconds ``` With streaming transfer ``` 0 chunks took 8.783997897058725 seconds...

d4l3k

CLA Signed

LocalSGD / DiLoCo support

5

This is a tracking issue for adding LocalSGD support into torchft. There's been interest in LocalSGD support and it's something we'd like to be able to support. This should be...

d4l3k

enhancement

# Start light house RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 1000 # Start worker 0 REPLICA_GROUP_ID=0 CUDA_VISIBLE_DEVICES=2,3 TORCHFT_MANAGER_PORT=29512 TORCHFT_LIGHTHOUSE=http://localhost:29510 torchrun --nnodes 1 --nproc-per-node 2 train_fsdp.py # Start worker1: REPLICA_GROUP_ID=1...

mreso

CLA Signed

rust: add open telemetry tracing

Test plan: ``` TORCHFT_OTEL_OTLP=http://localhost:4317 torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 --quorum_tick_ms 2000 TORCHFT_OTEL_STDOUT=1 torchft_lighthouse --min_replicas 2 --join_timeout_ms 10000 ```

d4l3k

CLA Signed

Test manager join

The test currently fails because of the lock step behavior when join timeout is too short. When join timeout is long it completes but not with the tensors one would...

Jackmin801

CLA Signed

[dataloader] dataloading improvement tracking issue

2

This is a tracking issue for dataloader improvements. The current support is very basic and we likely need to make some bigger changes to make this more efficient - [...

d4l3k

enhancement

data

[CheckpointServer] use streaming transfers

5

The CheckpointServer currently uses torch.save/torch.load which requires allocating the entire buffer into memory. We want to instead use streaming transfers so we minimize the amount of CPU memory required. It...

d4l3k

enhancement

good first issue

support DDP bucket rebuilding

Todo

d4l3k

torchft
torchft copied to clipboard

Metadata

[WIP Fix pipe close warnings

Disable async quorum for the first quorum sync

make torchft work for llama3_8b 8x

LocalSGD / DiLoCo support

[WIP] FSDP example

rust: add open telemetry tracing

Test manager join

[dataloader] dataloading improvement tracking issue

[CheckpointServer] use streaming transfers

support DDP bucket rebuilding

← Metadata

Owner

Metadata

torchft torchft copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchft
torchft copied to clipboard