[BUG] RuntimeError: QuantizedMatmul has no CUDA implementation
Describe the bug RuntimeError: QuantizedMatmul has no CUDA implementation
To Reproduce
import mlx.core as mx
import mlx_lm
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-30B-A3B-Thinking-2507-4bit")
logits = model(mx.array([[1]]))
mx.eval(logits)
Expected behavior The model should load successfully and return a tensor of logits without errors. The mx.eval() call should complete execution and the logits variable should contain the model's output predictions for the input token.
Desktop (please complete the following information):
- OS Version: Ubuntu 24.04.3 LTS
- Version:
- mlx==0.28.0
- mlx-cuda==0.28.0
- mlx-lm==0.26.3
Additional context I was trying to run mlx-bench.py
with command:
$ python ./mlx-bench.py -m mlx-community/Qwen3-30B-A3B-Thinking-2507-4bit
Full stack trace:
Test configuration:
Prompt tokens: 512 (random)
Generation tokens: 128 (random)
Repetitions: 5
Warmup: True
Running warmup...
Warmup prompt processing...
Error: QuantizedMatmul has no CUDA implementation.
Traceback (most recent call last):
File "/home/eric/benchmarks/./mlx-bench.py", line 232, in main
results = run_benchmark(
^^^^^^^^^^^^^^
File "/home/eric/benchmarks/./mlx-bench.py", line 117, in run_benchmark
_ = test_prompt(model, tokenizer, min(32, n_prompt))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/eric/benchmarks/./mlx-bench.py", line 47, in test_prompt
mx.eval(logits) # Force evaluation
^^^^^^^^^^^^^^^
RuntimeError: QuantizedMatmul has no CUDA implementation.
Quantized matmuls are WIP for the CUDA back-end. It's probably the top priority, hopefully they will be in an upcoming release 🤞
You should be able to run benchmarks for half-precision (or fp32) models (e.g. mlx-community/Meta-Llama-3.1-8B-Instruct-bf16)
Many thanks for your fantastic work! I've conducted initial tests on the qmm branch. The quantized model runs, but the output is garbled. Do you know what might be causing this problem?
Could you please provide an approximate date for when support for quantized matrix multiplication, as well as quantization of Gather-gmm and KV cache, will be available?
same issue encountered.
env
- OS ubuntu 22.04
- GPU: L20
- CUDA: 12.8
- versions
- mlx 0.30.0
- mlx-cuda 0.30.0
- mlx-lm 0.28.3
issue
(mlxlm) test@node1:~# mlx_lm.generate --model mlx-community/Qwen3-0.6B-8bit --prompt "hello"
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.37MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 937/937 [00:00<00:00, 3.80MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 3.10MB/s]
tokenizer_config.json: 10.2kB [00:00, 18.4MB/s] | 1/9 [00:01<00:08, 1.08s/it]
model.safetensors.index.json: 49.7kB [00:00, 51.5MB/s] | 0.00/937 [00:00<?, ?B/s]
merges.txt: 1.67MB [00:00, 8.28MB/s] | 0.00/707 [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 8.81MB/s]
vocab.json: 2.78MB [00:00, 11.9MB/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 633M/633M [00:12<00:00, 52.6MB/s]
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:12<00:00, 1.42s/it]
==========: 2.39MB [00:00, 12.0MB/s]
Traceback (most recent call last):
File "/home/test/anaconda3/envs/mlxlm/bin/mlx_lm.generate", line 7, in <module>
sys.exit(main())
^^^^^^
File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 1227, in main
response = generate(
^^^^^^^^^
File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 759, in generate
for response in stream_generate(model, tokenizer, prompt, **kwargs):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 696, in stream_generate
for n, (token, logprobs, from_draft) in enumerate(token_generator):
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 686, in <genexpr>
(token, logprobs, False) for token, logprobs in token_generator
^^^^^^^^^^^^^^^
File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 431, in generate_step
mx.eval([c.state for c in prompt_cache])
RuntimeError: QuantizedMatmul has no CUDA implementation.
Tried to run with bf16 version and it worked well
(mlxlm) test@node1:~# mlx_lm.generate --model mlx-community/Qwen3-0.6B-bf16 --prompt "hello"
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 784/784 [00:00<00:00, 3.05MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 2.81MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.67MB/s]
model.safetensors.index.json: 22.1kB [00:00, 38.3MB/s] | 0.00/784 [00:00<?, ?B/s]
tokenizer_config.json: 9.71kB [00:00, 18.3MB/s] | 0.00/707 [00:00<?, ?B/s]
merges.txt: 1.67MB [00:00, 20.0MB/s] | 0.00/613 [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 9.02MB/s]
vocab.json: 2.78MB [00:00, 15.5MB/s] | 37.4k/11.4M [00:01<06:24, 29.6kB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.19G/1.19G [00:15<00:00, 76.2MB/s]
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:16<00:00, 1.82s/it]
==========tensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.19G/1.19G [00:15<00:00, 211MB/s]
<think>
Okay, the user just said "hello". I need to respond appropriately. Since they didn't ask a specific question, I should acknowledge their greeting. Maybe say "Hello!" and offer help. Let me check if there's any context I'm missing, but the message seems straightforward. I'll keep it friendly and open-ended.
</think>
Hello! How can I assist you today? 😊
==========
Prompt: 9 tokens, 2.304 tokens-per-sec
Generation: 82 tokens, 400.587 tokens-per-sec
Peak memory: 1.223 GB