[ENHANCEMENT] Multi-token Prediction(MTP) support
any plan to support MTP in deepseek v3? It seems to accumulate prediction
We are actively working on it. We have implemented one version and are doing testing to make sure it works as expected.
@Victarry Very grad to see nv guys have working on it. Is there any definate release plan? We are all looking forward to try MTP.
@lk137095576 You can view the code prior to the official release.
@Victarry Could you please share the progress of MTP in Megatron-LM ? We are looking forward to try this amazing feature.
@zhaoyang-star The implementation of MTP is almost ready and we have verified the convergence of MTP.
One remaining issue is to decide the best design to make MTP work well with future performance optimizations like computation-communication overlapping in PP. The current target release is Core 0.12.
@lk137095576 hey when you say deepseek v3 ? do you mean to say megatron LM / megatron core is adding support for pretraining deepseek v3 model ?
if yes could you please share the link to the actual resource.
Hi guys, I'm glad to announce that the MTP in MCore has been merged into main branch. https://github.com/NVIDIA/Megatron-LM/commit/dc385c76f3ced50f5b05597cbe09ab4ab5192b7d
Welcome to have a try and feel free to raise issues if facing any problems. Thanks for your support and feedbacks!.
@Victarry Thanks for your effort! Do we have a target date for the 0.12.0 release?
cc @ko3n1g for the date of core_0.12.0 public release date.
Hi guys, I'm glad to announce that the MTP in MCore has been merged into main branch. dc385c7
Welcome to have a try and feel free to raise issues if facing any problems. Thanks for your support and feedbacks!.
@Victarry Hi! I've noticed a potential issue: When the output layer is shared between the main module and MTP module, its weight's backward hook should be triggered twice. Theoretically, the second trigger should cause an assertion error: "assert param not in self.params_with_grad, 'Cannot set grad twice'". This is because the first call to register_grad_ready already added this parameter to self.params_with_grad.
Have you encountered this error? If not, what might be the reason? Is there a special mechanism handling gradient accumulation for shared parameters? Thanks
Hi, @Infi-zc, thanks for your feedback. Can you provide the hyperparameters used to reproduce the issue? I will first try to replicate the problem and then proceed with debugging and analysis.
The feature request for MTP support has been implemented and merged in commit dc385c7. We're closing this issue.
@Infi-zc - please open a new issue for the backward hook concern you've identified including a repro. This will help us track and address the issue.
dc385c7
Thanks for the reply! The earlier worry was unnecessary, we can disregard this now.