mlc-llm
mlc-llm copied to clipboard
[Bug] Trouble to run `mlc_llm chat` with Gemma 3 models
🐛 Bug
I am having troubles when I run mlc-llm with Gemma 3 models on M3 Pro Macbook (details are below). The error is same as follows.
libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [17:09:18] {path/to/mlc-llm}/cpp/serve/sampler/cpu_sampler.cc:80: InternalError: Check failed: (false) is false: Possibly prob distribution contains NAN.
I tried to compile & run following Gemma 3 models on my M3 Pro Macbook.
- https://huggingface.co/google/gemma-3-12b-it (X)
- https://huggingface.co/google/gemma-3-4b-it (X)
- https://huggingface.co/google/gemma-3-1b-it (O; possible to run)
First, I doubt my compilation, so I just try to run mlc_llm chat with the MLC compiled models in HF:
- https://huggingface.co/mlc-ai/gemma-3-12b-it-q4f16_1-MLC (X; same error)
- https://huggingface.co/mlc-ai/gemma-3-4b-it-q4f16_1-MLC (X; same error)
- https://huggingface.co/mlc-ai/gemma-3-1b-it-q4f16_1-MLC (O)
I got the same errors just like the models I compiled, so it seems like some internal problem of mlc-llm.
I have memory on my machine enough to run gemma-2-9b-it, so It doesn't seem to be a free memory space issue. (maybe enough to run gemma-3-4b-it.)
And also there is no "gemma3_instruction" in the list(https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/interface/gen_config.py#L264) even though gemma 3 template is added(https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/conversation_template/gemma.py#L23-L37).
To Reproduce
Steps to reproduce the behavior:
- run
mlc_llm chat HF://mlc-ai/gemma-3-12b-it-q4f16_1-MLCon M3 Pro Macbook. - or run
mlc_llm chat HF://mlc-ai/gemma-3-4b-it-q4f16_1-MLCon M3 Pro Macbook.
Expected behavior
Chat starts like (copied from running mlc_llm chat HF://mlc-ai/gemma-3-1b-it-q4f16_1-MLC):
You can use the following special commands:
/help print the special commands
/exit quit the cli
/stats print out stats of last request (token/sec)
/metrics print out full engine metrics
/reset restart a fresh chat
/set [overrides] override settings in the generation config. For example,
`/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
Note: Separate stop words in the `stop` option with commas (,).
Multi-line input: Use escape+enter to start a new line.
>>>
Environment
- Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal
- Operating system (e.g. Ubuntu/Windows/MacOS/...): MacOS
- Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): M3 Pro Macbook 14
- How you installed MLC-LLM (
conda, source): source - How you installed TVM-Unity (
pip, source): source(3rdparty/tvm) - Python version (e.g. 3.10): 3.11.10
- GPU driver version (if applicable):
- CUDA/cuDNN version (if applicable):
- TVM Unity Hash Tag (
python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models):2b78e8b16073fb74c9e250eb50a898f4421ae3bc - Any other relevant information: