MNN
MNN copied to clipboard
[Bugfix, New Features] Ensure penalty sampler to be the first one in mixed samplers, accelerate TikToken, introduce pd disaggregation and separate acceleration on CPU backend
- Ensure penalty sampler to be the first one in mixed samplers. Penalty shall apply to the original logits.
- Accelerate
TikToken, leveragingTriedata structure.
| prompt length | original O(n^3) | Trie O(n) |
|---|---|---|
| 868 | 1.4s | 1ms |
| 24429 | 1h | 34ms |
(data on Snapdragon 8 Gen 3)
- Introduce pd disaggregation and separate acceleration on CPU backend.
- Prefill and decode now can have separate configurations.
- CPU prefill becomes 15%~100% faster across 9 devices.
To name a few:
| device | SoC | original | new | speedup |
|---|---|---|---|---|
| Huawei Mate 40 Pro | Kirin 9000 | 102 | 117 | 15% |
| Huawei Magic 6 | Snapdragon 8Gen3 | 148 | 285 | 92% |
| Xiaomi 15 Pro | Snapdragon 8 Elite | 258 | 368 | 42% |
(speed unit: token/s, tested on Qwen2.5-1.5B int4 quant, block=128)