mlx icon indicating copy to clipboard operation
mlx copied to clipboard

[BUG] RuntimeError: QuantizedMatmul has no CUDA implementation

Open ehartford opened this issue 4 months ago • 2 comments

Describe the bug RuntimeError: QuantizedMatmul has no CUDA implementation

To Reproduce

import mlx.core as mx
import mlx_lm
model, tokenizer = mlx_lm.load("mlx-community/Qwen3-30B-A3B-Thinking-2507-4bit")
logits = model(mx.array([[1]]))
mx.eval(logits)

Expected behavior The model should load successfully and return a tensor of logits without errors. The mx.eval() call should complete execution and the logits variable should contain the model's output predictions for the input token.

Desktop (please complete the following information):

  • OS Version: Ubuntu 24.04.3 LTS
  • Version:
    • mlx==0.28.0
    • mlx-cuda==0.28.0
    • mlx-lm==0.26.3

Additional context I was trying to run mlx-bench.py

mlx-bench.py

with command:

$ python ./mlx-bench.py -m mlx-community/Qwen3-30B-A3B-Thinking-2507-4bit

Full stack trace:

Test configuration:
  Prompt tokens: 512 (random)
  Generation tokens: 128 (random)
  Repetitions: 5
  Warmup: True

Running warmup...
  Warmup prompt processing...
Error: QuantizedMatmul has no CUDA implementation.
Traceback (most recent call last):
  File "/home/eric/benchmarks/./mlx-bench.py", line 232, in main
    results = run_benchmark(
              ^^^^^^^^^^^^^^
  File "/home/eric/benchmarks/./mlx-bench.py", line 117, in run_benchmark
    _ = test_prompt(model, tokenizer, min(32, n_prompt))
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/eric/benchmarks/./mlx-bench.py", line 47, in test_prompt
    mx.eval(logits)  # Force evaluation
    ^^^^^^^^^^^^^^^
RuntimeError: QuantizedMatmul has no CUDA implementation.

ehartford avatar Aug 25 '25 00:08 ehartford

Quantized matmuls are WIP for the CUDA back-end. It's probably the top priority, hopefully they will be in an upcoming release 🤞

You should be able to run benchmarks for half-precision (or fp32) models (e.g. mlx-community/Meta-Llama-3.1-8B-Instruct-bf16)

awni avatar Aug 25 '25 00:08 awni

Many thanks for your fantastic work! I've conducted initial tests on the qmm branch. The quantized model runs, but the output is garbled. Do you know what might be causing this problem?

Could you please provide an approximate date for when support for quantized matrix multiplication, as well as quantization of Gather-gmm and KV cache, will be available?

yanghaojin avatar Oct 24 '25 08:10 yanghaojin

same issue encountered.

env

  • OS ubuntu 22.04
  • GPU: L20
  • CUDA: 12.8
  • versions
    • mlx 0.30.0
    • mlx-cuda 0.30.0
    • mlx-lm 0.28.3

issue

(mlxlm) test@node1:~# mlx_lm.generate --model mlx-community/Qwen3-0.6B-8bit --prompt "hello"
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.37MB/s]
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 937/937 [00:00<00:00, 3.80MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 3.10MB/s]
tokenizer_config.json: 10.2kB [00:00, 18.4MB/s]                                                                          | 1/9 [00:01<00:08,  1.08s/it]
model.safetensors.index.json: 49.7kB [00:00, 51.5MB/s]                                                                       | 0.00/937 [00:00<?, ?B/s]
merges.txt: 1.67MB [00:00, 8.28MB/s]                                                                                         | 0.00/707 [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 8.81MB/s]
vocab.json: 2.78MB [00:00, 11.9MB/s]
model.safetensors: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 633M/633M [00:12<00:00, 52.6MB/s]
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:12<00:00,  1.42s/it]
==========: 2.39MB [00:00, 12.0MB/s]
Traceback (most recent call last):
  File "/home/test/anaconda3/envs/mlxlm/bin/mlx_lm.generate", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 1227, in main
    response = generate(
               ^^^^^^^^^
  File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 759, in generate
    for response in stream_generate(model, tokenizer, prompt, **kwargs):
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 696, in stream_generate
    for n, (token, logprobs, from_draft) in enumerate(token_generator):
                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 686, in <genexpr>
    (token, logprobs, False) for token, logprobs in token_generator
                                                    ^^^^^^^^^^^^^^^
  File "/home/test/anaconda3/envs/mlxlm/lib/python3.12/site-packages/mlx_lm/generate.py", line 431, in generate_step
    mx.eval([c.state for c in prompt_cache])
RuntimeError: QuantizedMatmul has no CUDA implementation.

Tried to run with bf16 version and it worked well

(mlxlm) test@node1:~# mlx_lm.generate --model mlx-community/Qwen3-0.6B-bf16 --prompt "hello"
config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 784/784 [00:00<00:00, 3.05MB/s]
added_tokens.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 707/707 [00:00<00:00, 2.81MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.67MB/s]
model.safetensors.index.json: 22.1kB [00:00, 38.3MB/s]                                                                       | 0.00/784 [00:00<?, ?B/s]
tokenizer_config.json: 9.71kB [00:00, 18.3MB/s]                                                                              | 0.00/707 [00:00<?, ?B/s]
merges.txt: 1.67MB [00:00, 20.0MB/s]                                                                                         | 0.00/613 [00:00<?, ?B/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 9.02MB/s]
vocab.json: 2.78MB [00:00, 15.5MB/s]                                                                              | 37.4k/11.4M [00:01<06:24, 29.6kB/s]
model.safetensors: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 1.19G/1.19G [00:15<00:00, 76.2MB/s]
Fetching 9 files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:16<00:00,  1.82s/it]
==========tensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 1.19G/1.19G [00:15<00:00, 211MB/s]
<think>
Okay, the user just said "hello". I need to respond appropriately. Since they didn't ask a specific question, I should acknowledge their greeting. Maybe say "Hello!" and offer help. Let me check if there's any context I'm missing, but the message seems straightforward. I'll keep it friendly and open-ended.
</think>

Hello! How can I assist you today? 😊
==========
Prompt: 9 tokens, 2.304 tokens-per-sec
Generation: 82 tokens, 400.587 tokens-per-sec
Peak memory: 1.223 GB

jiyzhang avatar Nov 28 '25 02:11 jiyzhang