ktransformers [Bug] QWEN-235b-a22bprefill阶段特别慢

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
[x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
[x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

V0.3运行QWEN3-235B-A22B，prefill阶段特别慢，只有0.1 但是decode速度就比较正常，是什么原因？例如： prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7737.844 ms, 0.129 tokens/s 2025-05-17 12:28:55,283 - INFO - flashinfer.jit: Loading JIT ops: sampling 2025-05-17 12:28:55,324 - INFO - flashinfer.jit: Finished loading JIT ops: sampling 81454 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7813.713 ms, 0.128 tokens/s 198 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7429.448 ms, 0.135 tokens/s 101086 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7682.772 ms, 0.130 tokens/s 99212 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7267.888 ms, 0.138 tokens/s 17 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7359.110 ms, 0.136 tokens/s

decode_batch_i: 1, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 153.188 ms, 6.528 tokens/s 11319 decode_batch_i: 1, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 154.211 ms, 6.485 tokens/s 151645 decode_batch_i: 1, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 154.571 ms, 6.470 tokens/s 198 decode_batch_i: 1,

Reproduction

python /home/zoutengda/KT-conda/ktransformers/ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path /home/zoutengda/qwen3_gguf --gguf_path /home/zoutengda/qwen3_models --optimize_config_path /home/zoutengda/KT-conda/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml --port 10002 --cpu_infer 38 --chunk_size 256 --max_new_tokens 4096 --max_batch_size 4 --cache_lens 32768 --backend_type balance_serve

Environment

CPU：6138双路 GPU：4090D 内存：512 64X8 2666

May 17 '25 05:05 kuku-jiusgan

same issue

May 24 '25 15:05 Jotakak-yu

同问，我部署的Deepseek R1也有这个问题

Jul 01 '25 08:07 gitCky

同，deepseek q4_k_m同问题

Jul 24 '25 06:07 abxis