mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Bug] qwen3-14b_q4f16 on GPU cause 100% of CPU when input request become more then 1500 tokens. Inference cause to become forever. (tested on CUDA,ROCM)

Open delphiRo opened this issue 7 months ago • 3 comments

🐛 Bug

To Reproduce

I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 core where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time

Expected behavior

mlc_llm serve HF://mlc-ai/Qwen3-14B-q4f16_1-MLC --port 8081 --overrides "tensor_parallel_shards=1;max_total_seq_length=32768;context_window_size=32768" --mode interactive

Environment

-CUDA 12.4 or ROCM 6.2.4

  • last version of MLC_LLM for CUDA 12.4
  • Tesla P100

Additional context

  • Dual CPU Server with 16 cores on each CPU, Intel Broadwell-eP

delphiRo avatar Aug 25 '25 14:08 delphiRo

I Also check on AMD Instinct MI50 of Rocm 6.2.4. The same problem is also here when enabled the 32-40K context too

delphiRo avatar Aug 27 '25 13:08 delphiRo

@simonw @jeethu @Sing-Li @philippgille

delphiRo avatar Aug 29 '25 09:08 delphiRo

The bug is reproduced even via MLC_LLM chat and when the openAI API request sends a large prompt

delphiRo avatar Aug 29 '25 09:08 delphiRo