torchtitan
torchtitan copied to clipboard
DeepSeek V3 Support
@tianyu-l Support for DeepSeek-V3 would be excellent given their top-tier performance.
Main parallelism components:
- 64-way expert parallelism
- 16-way pipeline parallelism
- with ZeRO-1 data parallelism
- Note: they do not apply TP.
Other main modeling components:
- multi-head latent attention (MLA)
- multi-token prediction with their MTP modules
- mixed-precision training (mix of FP8, BF16, FP32)
Model weights: https://huggingface.co/deepseek-ai/DeepSeek-V3 Paper link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf
Performance:
@tianyu-l Given the performance of this specific model and the recent boom in activity, can we reasonably expect TorchTitan to support this model?
I understand this model is not created by Meta, but I (along with others) would value a contribution on efficient training of this model in TorchTitan.
I agree we probably should prioritize supporting this model.
However I feel supporting all training optimizations mentioned in the technical report could be heavy and/or may not be totally aligned with the purpose of torchtitan. Would it still be interesting if we support the model and train it "in our own way", e.g. using parallelisms / optimizations similar to what we do to Llama?
@tianyu-l I am mainly interested in a model architecture implementation. The remaining details like FP8 training and various forms of parallelism is already implemented in TorchTitan, which should be reused.
So it's mainly the following components that I am asking for:
- MoE
- multi-head latent attention (MLA)
- multi-token prediction heads
A good starting point would be if we can convert the weights provided on Huggingface to TorchTitan and continue to train it.
is MoE architecture already supported in this branch? https://github.com/pytorch/torchtitan/pull/730
Hi @lxww302 - that branch is working for tp2ep. I hit some issues re: dp2ep, so I would not use that yet, but tp2ep was working perfectly in my brief spin with it.
Hey folks, super interested in DeepSeek support updates.
See an experiment config in https://github.com/pytorch/torchtitan/blob/7d5f3cc698853d2227cf5433776406d0e0345424/torchtitan/experiments/deepseek_v3/
Does Titan support V3 training now?
@Opdoop we are working on a clean version in https://github.com/pytorch/torchtitan/tree/deepseek-v3/torchtitan/models/deepseek_v3
Will land in main branch torchtitan/models folder soon
Hi @tianyu-l , what's the status of deepseek_v3 support? Is it complete? If so, is there an example config with stats on step time, MFU, or loss convergence? Thanks.
Hi @ruomingp, thanks for asking. Happy to share progress.
Currently we have the model + FSDP + TP + basic PP + basic EP (+ CP under debugging) + torch.compile (+ experimental rowwise FP8).
In terms of model correctness, we verified FSDP+TP loss converging. We are doing final parity checks with HF implementation. Currently we don't have MTP but that can be added.
The throughput won't be very high today as some important features in EP are still work in progress, e.g.
- those in DeepEP, such as hierarchical all-to-all for cross-node comm dedup, NVSHMEM based all-to-all to avoid D2H syncs (this might be worked around by directly use DeepEP)
- those to hide all-to-all comms, e.g. shared expert overlapping (straightforward to add but being model intrusive), or DualPipe
Let me know if you'd like to learn more details.