FastDeploy icon indicating copy to clipboard operation
FastDeploy copied to clipboard

[Cherry-Pick][Speculative Decoding][BugFix]Fix attention bug in spec decoding(#5460)

Open Copilot opened this issue 1 month ago • 2 comments

Motivation

Cherry-pick PR #5460 to release/2.4 branch to fix KV partitioning bug in speculative decoding.

Modifications

fastdeploy/engine/engine.py

  • Removed hardcoded FLAGS_max_partition_size=1024 for multimodal models

fastdeploy/model_executor/layers/attention/append_attn_backend.py

  • Disable KV partitioning when speculative decoding is active by setting max_partition_size = max_seq_len
  • Prevents split KV bug that corrupts attention in speculative decoding

Usage or Command

No usage changes. Fix is applied automatically when speculative decoding is enabled.

Accuracy Tests

Accuracy testing performed in original PR #5460.

Checklist

  • [x] Add at least a tag in the PR title.
    • Tag list: [[FDConfig],[APIServer],[Engine], [Scheduler], [PD Disaggregation], [Executor], [Graph Optimization], [Speculative Decoding], [RL], [Models], [Quantization], [Loader], [OP], [KVCache], [DataProcessor], [BugFix], [Docs], [CI], [Optimization], [Feature], [Benchmark], [Others], [XPU], [HPU], [GCU], [DCU], [Iluvatar], [Metax]]
    • You can add new tags based on the PR content, but the semantics must be clear.
  • [x] Format your code, run pre-commit before commit.
  • [x] Add unit tests. Please write the reason in this PR if no unit tests.
  • [x] Provide accuracy results.
  • [x] If the current PR is submitting to the release branch, make sure the PR has been submitted to the develop branch, then cherry-pick it to the release branch with the [Cherry-Pick] PR tag.
Original prompt

This section details on the original issue you should resolve

<issue_title>把这个 https://github.com/PaddlePaddle/FastDeploy/pull/5460 改动也给 release/2.4 提交一下</issue_title> <issue_description></issue_description>

<agent_instructions>cherry-pick PR</agent_instructions>

Comments on the Issue (you are @copilot in this section)

  • Fixes PaddlePaddle/FastDeploy#5478

✨ Let Copilot coding agent set things up for you — coding agent works faster and does higher quality work when set up for your repo.

Copilot avatar Dec 09 '25 17:12 Copilot

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

CLAassistant avatar Dec 09 '25 17:12 CLAassistant

Thanks for your contribution!

paddle-bot[bot] avatar Dec 09 '25 17:12 paddle-bot[bot]