[Bug] Qwen3-0.6B-q4f16_ft-MLC quantization results in an unstable model?
🐛 Bug
I've spent a fair amount of time quantizing the different SLM version of Qwen3 using q4f16_ft and can confirm it results in an unstable model at least for Qwen3-0.6B and Qwen3-1.7B. Quantizing to q4f16_1 does result in a stable model.
To Reproduce
Steps to reproduce the behavior:
git clone "https://huggingface.co/Qwen/Qwen3-0.6B"
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm convert_weight Qwen3-0.6B --quantization q4f16_ft -o qwen3-0.6b-q4f16_ft-MLC --device cuda
mlc_llm compile qwen3-0.6b-q4f16_ft-MLC -o qwen3-0.6b-q4f16_ft-MLC/lib.so --device cuda
mlc_llm chat qwen3-0.6b-q4f16_ft-MLC --model-lib qwen3-0.6b-q4f16_ft-MLC/lib.so --device cuda
"write a haiku about my third favorite mini slinkie"
Results in gibberish with the 0.6B and 1.7B models. Larger qwen3 models seem to fare better. All model sizes behave as expected with q4f16_1 conversion.
onders
оград
оград
оград
оград
оград
оград
оград
оград
оград
оград
Expected behavior
<think>
Okay, the user wants a haiku about their third favorite mini slinkie. First, I need to recall what a haiku is. It's a traditional Japanese form with three lines, syllable structure, and a specific syllable count. The user probably wants this to be in English, so I should make sure to translate their request accordingly. ...
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): CUDA
- Operating system (e.g. Ubuntu/Windows/MacOS/...): Ubuntu 22.04
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): Jetson Orin NX 16GB
- How you installed MLC-LLM (
conda, source): jetson-containers run dustynv/mlc:0.20.0-r36.4.0 - How you installed TVM-Unity (
pip, source): jetson-containers run dustynv/mlc:0.20.0-r36.4.0 - Python version (e.g. 3.10): 3.10
- GPU driver version (if applicable):
- CUDA/cuDNN version (if applicable): 12.6
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): - Any other relevant information:
Additional context
jtop 4.3.2 - (c) 2024, Raffaello Bonghi [[email protected]]
Website: https://rnext.it/jetson_stats
Platform
Machine: aarch64
System: Linux
Distribution: Ubuntu 22.04 Jammy Jellyfish
Release: 5.15.136-tegra
Python: 3.10.12
Libraries
CUDA: 12.6.68
cuDNN: 8.9.4.25
TensorRT: 10.3.0.30
VPI: 3.2.4
Vulkan: 1.3.204
OpenCV: 4.8.0 with CUDA: NO
Serial Number: [s|XX CLICK TO READ XXX]
Hardware
Model: NVIDIA Jetson Orin NX Engineering Reference Developer Kit
699-Level Part Number: 699-13767-0000-301 G.1
P-Number: p3767-0000
Module: NVIDIA Jetson Orin NX (16GB ram)
SoC: tegra234
CUDA Arch BIN: 8.7
L4T: 36.3.0
Jetpack: 6.0
Hostname: alfiebot
Interfaces
wlan0: 192.168.50.177
docker0: 172.17.0.1
I've also tried a handful of different gen_config types, none result in a usable model.
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template qwen2 -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template llama-2 -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template llama-2 --context-window-size 32768 --prefill-chunk-size 4096 -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml --context-window-size 32768 --prefill-chunk-size 4096 -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml --context-window-size 32768 --prefill-chunk-size 1024 -o qwen3-0.6b-q4f16_ft-MLC
mlc_llm gen_config Qwen3-0.6B --quantization q4f16_ft --conv-template chatml --context-window-size 32768 --max-batch-size 1 --prefill-chunk-size 1024 -o qwen3-0.6b-q4f16_ft-MLC
hello,my device is nvidia jetson orin. I also use jetson-containers to run mlc_llm and deployee qwen3-8b on it.
But my model can not use function calling.
I want to know is your qwen model can use function calling?
can i ask u some about the setup? Currently i set up the environment to build and run on Android dev, but the problem is they said that i lack of libmlc_llm_module.dylib. When i do research they said i have to gene it by my self, like create a build folder and gene it. Do i really need it, when i search to doc it really says that just need using conda and pip install the link. but i do the same and the log still that that missing that file