mlc-llm [Bug] Qwen3-0.6B-q4f16_ft-MLC quantization results in an unstable model?

🐛 Bug

I've spent a fair amount of time quantizing the different SLM version of Qwen3 using q4f16_ft and can confirm it results in an unstable model at least for Qwen3-0.6B and Qwen3-1.7B. Quantizing to q4f16_1 does result in a stable model.

To Reproduce

Steps to reproduce the behavior:

git clone "https://huggingface.co/Qwen/Qwen3-0.6B"
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm convert_weight Qwen3-0.6B --quantization q4f16_ft -o qwen3-0.6b-q4f16_ft-MLC --device cuda
mlc_llm compile qwen3-0.6b-q4f16_ft-MLC -o qwen3-0.6b-q4f16_ft-MLC/lib.so --device cuda
mlc_llm chat qwen3-0.6b-q4f16_ft-MLC --model-lib qwen3-0.6b-q4f16_ft-MLC/lib.so --device cuda 
"write a haiku about my third favorite mini slinkie"

Results in gibberish with the 0.6B and 1.7B models. Larger qwen3 models seem to fare better. All model sizes behave as expected with q4f16_1 conversion.

onders
оград
оград
оград
оград
оград
оград
оград
оград
оград
оград

Expected behavior

<think>
Okay, the user wants a haiku about their third favorite mini slinkie. First, I need to recall what a haiku is. It's a traditional Japanese form with three lines, syllable structure, and a specific syllable count. The user probably wants this to be in English, so I should make sure to translate their request accordingly. ...

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Jetson Orin NX 16GB
How you installed MLC-LLM (conda, source): jetson-containers run dustynv/mlc:0.20.0-r36.4.0
How you installed TVM-Unity (pip, source): jetson-containers run dustynv/mlc:0.20.0-r36.4.0
Python version (e.g. 3.10): 3.10
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable): 12.6
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):
Any other relevant information:

Additional context

jtop 4.3.2 - (c) 2024, Raffaello Bonghi [[email protected]]
Website: https://rnext.it/jetson_stats

Platform
 Machine: aarch64
 System: Linux
 Distribution: Ubuntu 22.04 Jammy Jellyfish
 Release: 5.15.136-tegra
 Python: 3.10.12

Libraries
 CUDA: 12.6.68
 cuDNN: 8.9.4.25
 TensorRT: 10.3.0.30
 VPI: 3.2.4
 Vulkan: 1.3.204
 OpenCV: 4.8.0 with CUDA: NO

Serial Number: [s|XX CLICK TO READ XXX]
Hardware
 Model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit
 699-Level Part Number: 699-13767-0000-301 G.1
 P-Number: p3767-0000
 Module: NVIDIA Jetson Orin NX (16GB ram)
 SoC: tegra234
 CUDA Arch BIN: 8.7
 L4T: 36.3.0
 Jetpack: 6.0

Hostname: alfiebot
Interfaces
 wlan0: 192.168.50.177
 docker0: 172.17.0.1

Jul 02 '25 20:07 alansrobotlab2

I've also tried a handful of different gen_config types, none result in a usable model.

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml -o qwen3-0.6b-q4f16_ft-MLC

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template qwen2 -o qwen3-0.6b-q4f16_ft-MLC

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template llama-2 -o qwen3-0.6b-q4f16_ft-MLC

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template llama-2 --context-window-size 32768 --prefill-chunk-size 4096 -o qwen3-0.6b-q4f16_ft-MLC

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml --context-window-size 32768 --prefill-chunk-size 4096 -o qwen3-0.6b-q4f16_ft-MLC

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml --context-window-size 32768 --prefill-chunk-size 1024 -o qwen3-0.6b-q4f16_ft-MLC

mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml --context-window-size 32768 --max-batch-size 1 --prefill-chunk-size 1024 -o qwen3-0.6b-q4f16_ft-MLC

Jul 03 '25 02:07 alansrobotlab2

hello，my device is nvidia jetson orin. I also use jetson-containers to run mlc_llm and deployee qwen3-8b on it. But my model can not use function calling. I want to know is your qwen model can use function calling?

Sep 18 '25 08:09 wawaRou

can i ask u some about the setup? Currently i set up the environment to build and run on Android dev, but the problem is they said that i lack of libmlc_llm_module.dylib. When i do research they said i have to gene it by my self, like create a build folder and gene it. Do i really need it, when i search to doc it really says that just need using conda and pip install the link. but i do the same and the log still that that missing that file

Nov 03 '25 08:11 quanvm31