torchft issues

Example train_ddp.py breaks

Hi, I was following the guide in README to run torchft locally. ``` # start lighthouse RUST_BACKTRACE=1 torchft_lighthouse --min_replicas 1 --quorum_tick_ms 100 --join_timeout_ms 10000 # start a replica in another...

kasakun

Include Option to Keep Global Batch Size Constant

10

# Maintain Constant Global Batch Size Upon Failure With the current implementation of `DistributedSampler`, the `global_batch_size` is `group_batch_size * num_replica_group`. It may be more preferable if the DistributedSampler is implemented...

WarrenZhu050413

Keep the training data continuous and the total batch size constant regardless of changes in the replica world size.

1

The current `train_ddp.py` has two problems: * It cannot guarantee the sequential reading of each sample. For example, the replica group world size is 3, but only 2 replicas are...

zhengchenyu

CLA Signed

Adjust the log level to avoid excessive logging.

By default, the current training generates too many logs. Here, some less important logs are updated to debug. If necessary, we can enable these logs by setting `RUST_LOG=DEBUG` and adjusting...

zhengchenyu

CLA Signed

Fix inconsistent return types.

When `self._pg.allreduce([tensor], opts)` throws an exception, it returns `_DummyWork`, which is different from the normally returned `_ManagedWork`. This will cause the process to exit due to the following error. ```...

zhengchenyu

CLA Signed

DDP models are different when training is interrupted

1

Hi folks, not sure if I'm doing anything wrong. I saw a problem where the final models across ranks are different when training is interrupted. To reproduce: Use the following...

btian

add copyright on all files

1

Summary: as title Differential Revision: D86800665

tushar00jain

CLA Signed

fb-exported

meta-exported

integrate torchcomms

1

Differential Revision: D86343575

tushar00jain

CLA Signed

fb-exported

meta-exported

ProcessGroupBabyNCCL - EOFError

I saw this error when replacing ProcessGroupNCCL with ProcessGroupBabyNCCL with train_ddp.py. ProcessGroupNCCL and ProcessGroupGloo work fine. How can I debug this? ``` ERROR:torchft.manager:[/0 - step 0] got exception in future...

btian

Using get_optimizer_state_dict inside state_dict causes TorchFT to get stuck

When I use torch.distributed.checkpoint.state_dict.get_optimizer_state_dict, e.g., ``` optimizer_state_dict = get_optimizer_state_dict( model=self._model, optimizers=self._optimizer, options=StateDictOptions( full_state_dict=True, cpu_offload=True, ), ) ``` instead of `optimizer_state_dict = self._optimizer.state_dict()`, TorchFT gets stuck at manager.py `should_commit()` method. Why...

btian

torchft
torchft copied to clipboard

Metadata

Example train_ddp.py breaks

Include Option to Keep Global Batch Size Constant

Keep the training data continuous and the total batch size constant regardless of changes in the replica world size.

Adjust the log level to avoid excessive logging.

Fix inconsistent return types.

DDP models are different when training is interrupted

add copyright on all files

integrate torchcomms

ProcessGroupBabyNCCL - EOFError

Using get_optimizer_state_dict inside state_dict causes TorchFT to get stuck

← Metadata

Owner

Metadata

torchft torchft copied to clipboard

Metadata

← Metadata

Owner

Metadata

torchft
torchft copied to clipboard