lightllm
lightllm copied to clipboard
fix: MTP in chunked prefill mode
在 chunked prefill 模式下,当一个长序列被分成多个 chunck 处理时,用来来填充 draft model 的 kv cache 的 next_token_ids 可能并不正确,在 ModelInput 里面添加下一个 chunk 的首个 id 来辅助 mtp 推理。
Summary of Changes
Hello @sufubao, I'm Gemini Code Assist[^1]! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
此拉取请求旨在解决分块预填充模式下多 token 预测 (MTP) 推理的准确性问题。当处理长序列并将其分割成多个块时,用于填充草稿模型 KV 缓存的 next_token_ids 可能不正确。通过在 ModelInput 中引入一个新字段来存储下一个块的首个 token ID,并相应地调整 MTP 推理逻辑,确保了在分块预填充场景下草稿模型的 KV 缓存能够被正确填充,从而提高了推理的准确性。
Highlights
- 新增 ModelInput 字段: 在
ModelInput类中新增b_chunked_prefill_next_token_ids_cpu字段,用于存储分块预填充模式下下一个块的首个 token ID,以辅助 MTP 推理。 - 修改 get_chuncked_input_token_ids 返回值:
infer_batch.py中的get_chuncked_input_token_ids方法现在返回当前块的 token ID 列表以及下一个块的首个 token ID,为 MTP 修正提供必要信息。 - MTP 推理逻辑修正: 在
chunked_prefill和dp_backend模式的prefill_mtp方法中,根据请求是否已完成输出,选择使用常规的next_token_ids或新引入的b_chunked_prefill_next_token_ids_cpu来填充草稿模型的 KV 缓存,从而修正了分块预填充模式下的 MTP 推理问题。 - 预处理逻辑更新:
generic_padded_pre_process.py和generic_pre_process.py中的预处理逻辑已更新,以正确收集下一个块的首个 token ID 并将其传递给ModelInput。
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in pull request comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with :thumbsup: and :thumbsdown: on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
[^1]: Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.