axolotl Support DeepSpeed AutoTP

⚠️ Please check that this feature request hasn't been suggested before.

[x] I searched previous Ideas in Discussions didn't find any similar feature requests.
[x] I searched previous Issues didn't find any similar feature requests.

🔖 Feature description

AutoTP was added a few weeks ago in 0.16.4 and claims speedups of up to 4x.

While ZeRO3 offers superior memory efficiency, it incurs significant communication costs. ZeRO (1/2) has lower communication overhead, but in the case of very large models, it cannot be used directly due to memory limitations. Therefore, combining TP with ZeRO (1/2) offers more balanced options for memory and performance. Moreover, through TP, we can alleviate the batch scaling limitations imposed by ZeRO/FSDP.

https://github.com/deepspeedai/DeepSpeed/blob/master/blogs/huggingface-tp/README.md

✔️ Solution

~When enabling AutoTP on a Mistral model with ZeRO 2, an error is triggered right at the beginning of training "dataset inconsistency error between DP and TP".~ This is solved, requires accelerate>=1.6.0

This DeepSpeed config is tested on 8x H100:

{
  "zero_optimization": {
    "stage": 2,
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "tensor_parallel":{
    "autotp_size": 8
  },
  "bf16": {
    "enabled": true
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

❓ Alternatives

No response

📝 Additional Context

@winglian asked me to open an issue

Acknowledgements

[x] My issue title is concise, descriptive, and in title casing.
[x] I have searched the existing issues to make sure this feature has not been requested yet.
[x] I have provided enough information for the maintainers to understand and evaluate this request.

Apr 05 '25 15:04 casper-hansen

New example is out for this.

https://github.com/deepspeedai/DeepSpeedExamples/blob/592d28fa45c12613f39ed388e043be760707237c/training/tensor_parallel/train.py

Apr 10 '25 15:04 casper-hansen

@casper-hansen when you tried it, did you run into:

[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1772, in inner
[rank0]:     args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
[rank0]:                          ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 490, in check_dataloader_inputs_same_across_ranks
[rank0]:     broadcast_and_check(kwargs, bcast_rank, bcast_group)
[rank0]:   File "/root/miniconda3/envs/py3.11/lib/python3.11/site-packages/deepspeed/runtime/engine.py", line 479, in broadcast_and_check
[rank0]:     assert torch.equal(                                                                                             
[rank0]:            ^^^^^^^^^^^^
[rank0]: AssertionError: Data inconsistency within the TP group. Please check the Dataloader implementation to ensure consistency.

Apr 28 '25 05:04 winglian

@winglian yes, but I seem to remember upgrading to latest accelerate fixed it

Apr 28 '25 05:04 casper-hansen

@casper-hansen this should be working now in lastest main

Jul 15 '25 01:07 winglian