mlc-llm icon indicating copy to clipboard operation
mlc-llm copied to clipboard

[Speed] MLC is much slower than Ollama when running Qwen Coder 30b

Open capyun opened this issue 8 months ago • 2 comments

🏎️ Speed Report

  • The model code: Qwen Coder 30b, and I use q4f16_1 quantization.

  • The model configuration (e.g. quantization mode, running data type, etc.): q4f16_1, q4f16_ft.

  • Device (e.g. MacBook Pro M2, PC+RTX 3080): orin 64GB

  • OS (if applicable):

  • Encode speed (Token/s):

  • Decode speed (Token/s): approximately 6, and ollama is 20, the prompt len is 1024*16,

  • Memory usage (if applicable): The actual memory used is less than estimated.

capyun avatar Aug 21 '25 03:08 capyun

I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 cores where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time

delphiRo avatar Aug 25 '25 14:08 delphiRo

I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 cores where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time

I forget if I had this problem before, but when the prompt len is much larger than prefill chunk size, it will be very slow.

capyun avatar Sep 04 '25 10:09 capyun