ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug] Why AMX speed down?

Open mrgaolei opened this issue 3 months ago • 3 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

I try to deploy Qwen3-235b into AMX mode, Before that, I tried Q4_K_M mode, about 10 token/s,

Reproduction

But after I download BF16 gguf and using amx rule yaml ,speed down to 4 token/s.

Environment

Hadeware: Intel 8470Q DDR5 4800 768G RTX 3090

mrgaolei avatar Sep 29 '25 02:09 mrgaolei

AMX only takes effect during the prefill phase if the batch size is large enough, it does not participate in the decode phase in your case.

Also, BF16 provides higher precision, you should notice the model is ~4× larger in memory, the computation eats more CPU and memory bandwidth. So the slowdown is expected.

It would be interesting to compare Q4_K_M prefill with BF16_AMX prefill with a long input.

aubreyli avatar Sep 29 '25 03:09 aubreyli

首先,AMX只加速prefill过程 其次,相同配置下,BF16的生成速度要是比Q4快那你可以领图灵奖了

wqshmzh avatar Sep 29 '25 15:09 wqshmzh

首先,AMX只加速prefill过程 其次,相同配置下,BF16的生成速度要是比Q4快那你可以领图灵奖了

那么使用Q4模型,哪怕是prefill过程也不会加速是吗?

mrgaolei avatar Oct 28 '25 03:10 mrgaolei