mlc-llm [Bug] Trouble to run `mlc_llm chat` with Gemma 3 models

[Bug] Trouble to run `mlc_llm chat` with Gemma 3 models

Open grf53 opened this issue 6 months ago • 3 comments

🐛 Bug

I am having troubles when I run mlc-llm with Gemma 3 models on M3 Pro Macbook (details are below). The error is same as follows.

libc++abi: terminating due to uncaught exception of type tvm::runtime::InternalError: [17:09:18] {path/to/mlc-llm}/cpp/serve/sampler/cpu_sampler.cc:80: InternalError: Check failed: (false) is false: Possibly prob distribution contains NAN.

I tried to compile & run following Gemma 3 models on my M3 Pro Macbook.

https://huggingface.co/google/gemma-3-12b-it (X)
https://huggingface.co/google/gemma-3-4b-it (X)
https://huggingface.co/google/gemma-3-1b-it (O; possible to run)

First, I doubt my compilation, so I just try to run mlc_llm chat with the MLC compiled models in HF:

https://huggingface.co/mlc-ai/gemma-3-12b-it-q4f16_1-MLC (X; same error)
https://huggingface.co/mlc-ai/gemma-3-4b-it-q4f16_1-MLC (X; same error)
https://huggingface.co/mlc-ai/gemma-3-1b-it-q4f16_1-MLC (O)

I got the same errors just like the models I compiled, so it seems like some internal problem of mlc-llm. I have memory on my machine enough to run gemma-2-9b-it, so It doesn't seem to be a free memory space issue. (maybe enough to run gemma-3-4b-it.)

And also there is no "gemma3_instruction" in the list(https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/interface/gen_config.py#L264) even though gemma 3 template is added(https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/conversation_template/gemma.py#L23-L37).

To Reproduce

Steps to reproduce the behavior:

run mlc_llm chat HF://mlc-ai/gemma-3-12b-it-q4f16_1-MLC on M3 Pro Macbook.
or run mlc_llm chat HF://mlc-ai/gemma-3-4b-it-q4f16_1-MLC on M3 Pro Macbook.

Expected behavior

Chat starts like (copied from running mlc_llm chat HF://mlc-ai/gemma-3-1b-it-q4f16_1-MLC):

You can use the following special commands:
  /help               print the special commands
  /exit               quit the cli
  /stats              print out stats of last request (token/sec)
  /metrics            print out full engine metrics
  /reset              restart a fresh chat
  /set [overrides]    override settings in the generation config. For example,
                      `/set temperature=0.5;top_p=0.8;seed=23;max_tokens=100;stop=str1,str2`
                      Note: Separate stop words in the `stop` option with commas (,).
  Multi-line input: Use escape+enter to start a new line.

>>>

Environment

Platform (e.g. WebGPU/Vulkan/IOS/Android/CUDA): Metal
Operating system (e.g. Ubuntu/Windows/MacOS/...): MacOS
Device (e.g. iPhone 12 Pro, PC+RTX 3090, ...): M3 Pro Macbook 14
How you installed MLC-LLM (conda, source): source
How you installed TVM-Unity (pip, source): source(3rdparty/tvm)
Python version (e.g. 3.10): 3.11.10
GPU driver version (if applicable):
CUDA/cuDNN version (if applicable):
TVM Unity Hash Tag (python -c "import tvm; print('\n'.join(f'{k}: {v}' for k, v in tvm.support.libinfo().items()))", applicable if you compile models): 2b78e8b16073fb74c9e250eb50a898f4421ae3bc
Any other relevant information:

Additional context

Apr 16 '25 08:04 grf53

mlc-llm mlc-llm copied to clipboard

[Bug] Trouble to run `mlc_llm chat` with Gemma 3 models

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

mlc-llm
mlc-llm copied to clipboard