Megatron-LM [ENHANCEMENT] Multi-token Prediction(MTP) support

any plan to support MTP in deepseek v3? It seems to accumulate prediction

Feb 14 '25 01:02 lk137095576

We are actively working on it. We have implemented one version and are doing testing to make sure it works as expected.

Feb 14 '25 06:02 Victarry

@Victarry Very grad to see nv guys have working on it. Is there any definate release plan? We are all looking forward to try MTP.

Feb 17 '25 01:02 zhaoyang-star

@lk137095576 You can view the code prior to the official release.

Feb 18 '25 00:02 aoyulong

@Victarry Could you please share the progress of MTP in Megatron-LM ? We are looking forward to try this amazing feature.

Mar 08 '25 02:03 zhaoyang-star

@zhaoyang-star The implementation of MTP is almost ready and we have verified the convergence of MTP.

One remaining issue is to decide the best design to make MTP work well with future performance optimizations like computation-communication overlapping in PP. The current target release is Core 0.12.

Mar 10 '25 06:03 Victarry

@lk137095576 hey when you say deepseek v3 ? do you mean to say megatron LM / megatron core is adding support for pretraining deepseek v3 model ?

if yes could you please share the link to the actual resource.

Mar 21 '25 20:03 StephennFernandes

Hi guys, I'm glad to announce that the MTP in MCore has been merged into main branch. https://github.com/NVIDIA/Megatron-LM/commit/dc385c76f3ced50f5b05597cbe09ab4ab5192b7d

Welcome to have a try and feel free to raise issues if facing any problems. Thanks for your support and feedbacks!.

Mar 30 '25 11:03 Victarry

@Victarry Thanks for your effort! Do we have a target date for the 0.12.0 release?

Apr 09 '25 02:04 nobodyiam

cc @ko3n1g for the date of core_0.12.0 public release date.

Apr 09 '25 05:04 Victarry

Hi guys, I'm glad to announce that the MTP in MCore has been merged into main branch. dc385c7

Welcome to have a try and feel free to raise issues if facing any problems. Thanks for your support and feedbacks!.

@Victarry Hi! I've noticed a potential issue: When the output layer is shared between the main module and MTP module, its weight's backward hook should be triggered twice. Theoretically, the second trigger should cause an assertion error: "assert param not in self.params_with_grad, 'Cannot set grad twice'". This is because the first call to register_grad_ready already added this parameter to self.params_with_grad.

Have you encountered this error? If not, what might be the reason? Is there a special mechanism handling gradient accumulation for shared parameters? Thanks

May 14 '25 07:05 Infi-zc

Hi, @Infi-zc, thanks for your feedback. Can you provide the hyperparameters used to reproduce the issue? I will first try to replicate the problem and then proceed with debugging and analysis.

May 22 '25 10:05 shifangx

The feature request for MTP support has been implemented and merged in commit dc385c7. We're closing this issue.

@Infi-zc - please open a new issue for the backward hook concern you've identified including a repro. This will help us track and address the issue.

Jul 14 '25 18:07 sbhavani

dc385c7

Thanks for the reply! The earlier worry was unnecessary, we can disregard this now.

Jul 17 '25 09:07 Infi-zc