[Speed] MLC is much slower than Ollama when running Qwen Coder 30b

Open capyun opened this issue 8 months ago • 2 comments

🏎️ Speed Report

The model code: Qwen Coder 30b, and I use q4f16_1 quantization.
The model configuration (e.g. quantization mode, running data type, etc.): q4f16_1, q4f16_ft.
Device (e.g. MacBook Pro M2, PC+RTX 3080): orin 64GB
OS (if applicable):
Encode speed (Token/s):
Decode speed (Token/s): approximately 6, and ollama is 20, the prompt len is 1024*16,
Memory usage (if applicable): The actual memory used is less than estimated.

Aug 21 '25 03:08 capyun

I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 cores where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time

Aug 25 '25 14:08 delphiRo

I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 cores where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time

I forget if I had this problem before, but when the prompt len is much larger than prefill chunk size, it will be very slow.

Sep 04 '25 10:09 capyun