[Speed] MLC is much slower than Ollama when running Qwen Coder 30b
🏎️ Speed Report
-
The model code: Qwen Coder 30b, and I use q4f16_1 quantization.
-
The model configuration (e.g. quantization mode, running data type, etc.): q4f16_1, q4f16_ft.
-
Device (e.g. MacBook Pro M2, PC+RTX 3080): orin 64GB
-
OS (if applicable):
-
Encode speed (Token/s):
-
Decode speed (Token/s): approximately 6, and ollama is 20, the prompt len is 1024*16,
-
Memory usage (if applicable): The actual memory used is less than estimated.
I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 cores where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time
I have a very strange situation with MLC LLM. Tested on qwen3-14b_q4f16 When I increase the input token length request up to 1500-2500tokens then the MLC load my CPU up to 100% on 1 cores where the process of mlc_llm started. The token efficiency dramatically drops down. And the predict become consume a lot of time
I forget if I had this problem before, but when the prompt len is much larger than prefill chunk size, it will be very slow.