MNN icon indicating copy to clipboard operation
MNN copied to clipboard

[Bugfix, New Features] Ensure penalty sampler to be the first one in mixed samplers, accelerate TikToken, introduce pd disaggregation and separate acceleration on CPU backend

Open huangzhengxiang opened this issue 10 months ago • 0 comments

  1. Ensure penalty sampler to be the first one in mixed samplers. Penalty shall apply to the original logits.
  2. Accelerate TikToken, leveraging Trie data structure.
prompt length original O(n^3) Trie O(n)
868 1.4s 1ms
24429 1h 34ms

(data on Snapdragon 8 Gen 3)

  1. Introduce pd disaggregation and separate acceleration on CPU backend.
  • Prefill and decode now can have separate configurations.
  • CPU prefill becomes 15%~100% faster across 9 devices.

To name a few:

device SoC original new speedup
Huawei Mate 40 Pro Kirin 9000 102 117 15%
Huawei Magic 6 Snapdragon 8Gen3 148 285 92%
Xiaomi 15 Pro Snapdragon 8 Elite 258 368 42%

(speed unit: token/s, tested on Qwen2.5-1.5B int4 quant, block=128)

huangzhengxiang avatar May 21 '25 15:05 huangzhengxiang