candle *Major T/s improvement* Use the Metal qmatmul MM kernels

This PR adds the automatic usage of Metal GGML quantized mat-mat kernels instead of always using the mat-vec kernels and upstreams a few related/necessary changes.

Before this change, Candle's Metal decoding performance was on-par with MLX and llama.cpp but the prompt performance was insufficient. After this change, the prompt performance (on the benchmark) was increased to a factor of about 2.5x faster than MLX and within 10% of llama.cpp - a performance boost by a factor of almost 6x.

This PR switches to only using the MV kernels when D::Minus2 of the xs input tensor is equal to 1. This mirrors the logic in GGML.

Besides utilizing the MM kernels, this PR also upstreams some required changes:

Adds GGUF bf16 support (originally)
Updates quantized Metal kernels to support bf16 (originally)
Sync GGML <> Candle Metal kernels (originally)

Nov 14 '24 19:11 EricLBuehler

@LaurentMazare if you could review, that would be great!

More benchmarks with some smaller models can be found here: https://github.com/EricLBuehler/mistral.rs/issues/903#issuecomment-2477442513

Nov 14 '24 21:11 EricLBuehler

Why is this still not close/?

Jan 29 '25 08:01 lucasjinreal

Without merging this MR, is candle still slower than llama.cpp/ggml right now? Or has this improvement already been implemented in other code submissions?

Apr 02 '25 04:04 null-define

@null-define without this, Candle Metal prompt performance is significantly reduced. This is because we aren't using the specialized Matrix-Matrix kernels, instead using Matrix-Vector kernels repeatedly which is slower.

Apr 02 '25 18:04 EricLBuehler

wondering why it isn't merged into main? Does candle is now not maintained well?

Apr 03 '25 05:04 lucasjinreal

This is an amazing improvement!

After testing across 11 GGUF LLMs, the new code is 73% faster than the current version, exceeding llama.cpp speeds on my M3 Max.

M3 Max GGUF Candle Benchmarking

Data

Candle (CPU): 2.58 avg tokens/sec
Candle (Metal): 27.07 avg tokens/sec
MLX: 62.96 avg tokens/sec
Llama.cpp: 82.78 avg tokens/sec
Candle (Metal) + PR 2615: 100.60 avg tokens/sec

Computer Specs

M3 Max
36GB RAM
Mac OS 15.3.2 (24D81)

@LaurentMazare What would it take to get this merged?

Apr 22 '25 18:04 greenrazer

@meg-huggingface Please consider merge it

Apr 23 '25 03:04 lucasjinreal

wow. this sounds amazing. we sure could use any speed boost we can get on metal. This original PR is from almost 5 months ago. Why is there no discussion as the reason its not merged yet?

Apr 23 '25 14:04 AlpineVibrations

@LaurentMazare would this help with inference speed of Metal for Flux and SD3 image generation?

Apr 23 '25 21:04 AlpineVibrations

The candle team seems abundant in this lib?

Apr 25 '25 15:04 lucasjinreal

just checking in again on this hanging PR. is there anyone out there that can review? do we need to do it different or fix something? thanks

May 25 '25 21:05 AlpineVibrations

@LaurentMazare sorry to bug you but is there someone else we can ping to get this approved or at least some comment on why it's still sitting here for so many months? thanks

Jun 13 '25 22:06 AlpineVibrations

I think HuggingFace abandoned the Candle project.

Jun 14 '25 02:06 lucasjinreal

is that real?

Jun 14 '25 23:06 AlpineVibrations

I think it is now mainly community driven, and the core developers are lazy at merging new features, not even supporting new features, such as many low level ONNX ops. I couldn't see any response or support for it.

Jun 15 '25 01:06 lucasjinreal

maybe they should add some more admins that have merge authority. it seams like there are many people ready to work.

Jun 17 '25 15:06 AlpineVibrations

LGTM

Jul 18 '25 21:07 greenrazer

candle
candle copied to clipboard

Major T/s improvement Use the Metal qmatmul MM kernels

candle candle copied to clipboard

*Major T/s improvement* Use the Metal qmatmul MM kernels

candle
candle copied to clipboard

Major T/s improvement Use the Metal qmatmul MM kernels