ktransformers icon indicating copy to clipboard operation
ktransformers copied to clipboard

[Bug] QWEN-235b-a22bprefill阶段特别慢

Open kuku-jiusgan opened this issue 7 months ago • 3 comments

Checklist

  • [x] 1. I have searched related issues but cannot get the expected help.
  • [x] 2. The bug has not been fixed in the latest version.
  • [x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • [x] 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/kvcache-ai/ktransformers/discussions. Otherwise, it will be closed.
  • [x] 5. To help the community, I will use Chinese/English or attach an Chinese/English translation if using another language. Non-Chinese/English content without translation may be closed.

Describe the bug

V0.3运行QWEN3-235B-A22B,prefill阶段特别慢,只有0.1 但是decode速度就比较正常,是什么原因? 例如: prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7737.844 ms, 0.129 tokens/s 2025-05-17 12:28:55,283 - INFO - flashinfer.jit: Loading JIT ops: sampling 2025-05-17 12:28:55,324 - INFO - flashinfer.jit: Finished loading JIT ops: sampling 81454 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7813.713 ms, 0.128 tokens/s 198 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7429.448 ms, 0.135 tokens/s 101086 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7682.772 ms, 0.130 tokens/s 99212 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7267.888 ms, 0.138 tokens/s 17 prefill_batch_i: 254, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 7359.110 ms, 0.136 tokens/s

decode_batch_i: 1, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 153.188 ms, 6.528 tokens/s 11319 decode_batch_i: 1, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 154.211 ms, 6.485 tokens/s 151645 decode_batch_i: 1, padded_batch_size 57 capture_padded_batch_size 57 Model execution time (GPU): 154.571 ms, 6.470 tokens/s 198 decode_batch_i: 1,

Reproduction

python /home/zoutengda/KT-conda/ktransformers/ktransformers/server/main.py --architectures Qwen3MoeForCausalLM --model_path /home/zoutengda/qwen3_gguf --gguf_path /home/zoutengda/qwen3_models --optimize_config_path /home/zoutengda/KT-conda/ktransformers/ktransformers/optimize/optimize_rules/Qwen3Moe-serve.yaml --port 10002 --cpu_infer 38 --chunk_size 256 --max_new_tokens 4096 --max_batch_size 4 --cache_lens 32768 --backend_type balance_serve

Environment

CPU:6138双路 GPU:4090D 内存:512 64X8 2666

kuku-jiusgan avatar May 17 '25 05:05 kuku-jiusgan

same issue

Jotakak-yu avatar May 24 '25 15:05 Jotakak-yu

同问,我部署的Deepseek R1也有这个问题

gitCky avatar Jul 01 '25 08:07 gitCky

同,deepseek q4_k_m同问题

abxis avatar Jul 24 '25 06:07 abxis