[Bug] qwen3-14b_q4f16 on GPU cause 100% of CPU when input request become more then 1500 tokens. Inference cause to become forever. (tested on CUDA,ROCM)
🐛 Bug
To Reproduce
I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 core where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time
Expected behavior
mlc_llm serve HF://mlc-ai/Qwen3-14B-q4f16_1-MLC --port 8081 --overrides "tensor_parallel_shards=1;max_total_seq_length=32768;context_window_size=32768" --mode interactive
Environment
-CUDA 12.4 or ROCM 6.2.4
- last version of MLC_LLM for CUDA 12.4
- Tesla P100
Additional context
- Dual CPU Server with 16 cores on each CPU, Intel Broadwell-eP
I Also check on AMD Instinct MI50 of Rocm 6.2.4. The same problem is also here when enabled the 32-40K context too
@simonw @jeethu @Sing-Li @philippgille
The bug is reproduced even via MLC_LLM chat and when the openAI API request sends a large prompt