lightllm fix: MTP in chunked prefill mode

fix: MTP in chunked prefill mode

Open sufubao opened this issue 1 month ago • 1 comments

在 chunked prefill 模式下，当一个长序列被分成多个 chunck 处理时,用来来填充 draft model 的 kv cache 的 next_token_ids 可能并不正确，在 ModelInput 里面添加下一个 chunk 的首个 id 来辅助 mtp 推理。

Oct 14 '25 06:10 sufubao

Summary of Changes

Hello @sufubao, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

此拉取请求旨在解决分块预填充模式下多 token 预测 (MTP) 推理的准确性问题。当处理长序列并将其分割成多个块时，用于填充草稿模型 KV 缓存的 next_token_ids 可能不正确。通过在 ModelInput 中引入一个新字段来存储下一个块的首个 token ID，并相应地调整 MTP 推理逻辑，确保了在分块预填充场景下草稿模型的 KV 缓存能够被正确填充，从而提高了推理的准确性。

Highlights

新增 ModelInput 字段: 在 ModelInput 类中新增 b_chunked_prefill_next_token_ids_cpu 字段，用于存储分块预填充模式下下一个块的首个 token ID，以辅助 MTP 推理。
修改 get_chuncked_input_token_ids 返回值: infer_batch.py 中的 get_chuncked_input_token_ids 方法现在返回当前块的 token ID 列表以及下一个块的首个 token ID，为 MTP 修正提供必要信息。
MTP 推理逻辑修正: 在 chunked_prefill 和 dp_backend 模式的 prefill_mtp 方法中，根据请求是否已完成输出，选择使用常规的 next_token_ids 或新引入的 b_chunked_prefill_next_token_ids_cpu 来填充草稿模型的 KV 缓存，从而修正了分块预填充模式下的 MTP 推理问题。
预处理逻辑更新: generic_padded_pre_process.py 和 generic_pre_process.py 中的预处理逻辑已更新，以正确收集下一个块的首个 token ID 并将其传递给 ModelInput。

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Oct 14 '25 06:10 gemini-code-assist[bot]

lightllm lightllm copied to clipboard

fix: MTP in chunked prefill mode

Summary of Changes

Highlights

lightllm
lightllm copied to clipboard