torchtitan DeepSeek V3 Support

@tianyu-l Support for DeepSeek-V3 would be excellent given their top-tier performance.

Main parallelism components:

64-way expert parallelism
16-way pipeline parallelism
with ZeRO-1 data parallelism
Note: they do not apply TP.

Other main modeling components:

multi-head latent attention (MLA)
multi-token prediction with their MTP modules
mixed-precision training (mix of FP8, BF16, FP32)

Model weights: https://huggingface.co/deepseek-ai/DeepSeek-V3 Paper link: https://github.com/deepseek-ai/DeepSeek-V3/blob/main/DeepSeek_V3.pdf

Performance:

Dec 26 '24 15:12 casper-hansen

@tianyu-l Given the performance of this specific model and the recent boom in activity, can we reasonably expect TorchTitan to support this model?

I understand this model is not created by Meta, but I (along with others) would value a contribution on efficient training of this model in TorchTitan.

Jan 27 '25 08:01 casper-hansen

I agree we probably should prioritize supporting this model.

However I feel supporting all training optimizations mentioned in the technical report could be heavy and/or may not be totally aligned with the purpose of torchtitan. Would it still be interesting if we support the model and train it "in our own way", e.g. using parallelisms / optimizations similar to what we do to Llama?

Jan 27 '25 09:01 tianyu-l

@tianyu-l I am mainly interested in a model architecture implementation. The remaining details like FP8 training and various forms of parallelism is already implemented in TorchTitan, which should be reused.

So it's mainly the following components that I am asking for:

MoE
multi-head latent attention (MLA)
multi-token prediction heads

A good starting point would be if we can convert the weights provided on Huggingface to TorchTitan and continue to train it.

Jan 27 '25 09:01 casper-hansen

is MoE architecture already supported in this branch? https://github.com/pytorch/torchtitan/pull/730

Jan 27 '25 18:01 lxww302

Hi @lxww302 - that branch is working for tp2ep. I hit some issues re: dp2ep, so I would not use that yet, but tp2ep was working perfectly in my brief spin with it.

Jan 27 '25 20:01 lessw2020

Hey folks, super interested in DeepSeek support updates.

Feb 10 '25 19:02 aazzolini

See an experiment config in https://github.com/pytorch/torchtitan/blob/7d5f3cc698853d2227cf5433776406d0e0345424/torchtitan/experiments/deepseek_v3/

Does Titan support V3 training now?

Jul 08 '25 08:07 Opdoop

@Opdoop we are working on a clean version in https://github.com/pytorch/torchtitan/tree/deepseek-v3/torchtitan/models/deepseek_v3 Will land in main branch torchtitan/models folder soon

Jul 08 '25 16:07 tianyu-l

Hi @tianyu-l , what's the status of deepseek_v3 support? Is it complete? If so, is there an example config with stats on step time, MFU, or loss convergence? Thanks.

Jul 21 '25 12:07 ruomingp

Hi @ruomingp, thanks for asking. Happy to share progress.

Currently we have the model + FSDP + TP + basic PP + basic EP (+ CP under debugging) + torch.compile (+ experimental rowwise FP8).

In terms of model correctness, we verified FSDP+TP loss converging. We are doing final parity checks with HF implementation. Currently we don't have MTP but that can be added.

The throughput won't be very high today as some important features in EP are still work in progress, e.g.

those in DeepEP, such as hierarchical all-to-all for cross-node comm dedup, NVSHMEM based all-to-all to avoid D2H syncs (this might be worked around by directly use DeepEP)
those to hide all-to-all comms, e.g. shared expert overlapping (straightforward to add but being model intrusive), or DualPipe

Let me know if you'd like to learn more details.

Jul 22 '25 07:07 tianyu-l

torchtitan torchtitan copied to clipboard

DeepSeek V3 Support

torchtitan
torchtitan copied to clipboard