[Bug] Why AMX speed down?
Checklist
- [x] 1. I have searched related issues but cannot get the expected help.
- [x] 2. The bug has not been fixed in the latest version.
- [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
- [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
- [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.
Describe the bug
I try to deploy Qwen3-235b into AMX mode, Before that, I tried Q4_K_M mode, about 10 token/s,
Reproduction
But after I download BF16 gguf and using amx rule yaml ,speed down to 4 token/s.
Environment
Hadeware: Intel 8470Q DDR5 4800 768G RTX 3090
AMX only takes effect during the prefill phase if the batch size is large enough, it does not participate in the decode phase in your case.
Also, BF16 provides higher precision, you should notice the model is ~4× larger in memory, the computation eats more CPU and memory bandwidth. So the slowdown is expected.
It would be interesting to compare Q4_K_M prefill with BF16_AMX prefill with a long input.
首先,AMX只加速prefill过程 其次,相同配置下,BF16的生成速度要是比Q4快那你可以领图灵奖了
首先,AMX只加速prefill过程 其次,相同配置下,BF16的生成速度要是比Q4快那你可以领图灵奖了
那么使用Q4模型,哪怕是prefill过程也不会加速是吗?