llama.cpp Optimized DeepSeek V2/V3 implementation (MLA)

This PR introduces various optimizations for DeepSeek V2/V3 implementation:

caching latent representations instead of full key/value vectors
replaced "naive" attention implementation with implementation based on intermediate representations (https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/model.py)
split kv_b model tensors into k_b and v_b to avoid doing it during inference

Note that you need to reconvert the model to use this implementation.

Performance compared to the previous "naive" implementation:

deepseek-mla

deepseek-lite-mla-pp

deepseek-r1-mla

deepseek-mla-pp

CUDA performance is worse for short context lengths, but the curve is flatter:

deepseek-lite-mla

deepseek-lite-cuda-mla-pp

TODO:

remove unused kv_b tensor from the model
maybe add support for old model files (compute k_b and v_b during inference with reduced performance)
wait for completion of: #11213
implement MLA KV cache
~address regressions in prompt processing performance (different permutations of tensors?)~ - I don't think it's possible, as this implementation is more compute-intensive compared to regular attention implementation

Jan 27 '25 09:01 fairydreaming

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

Jan 29 '25 17:01 wronkiew

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew What model would you like to test?

Jan 29 '25 18:01 fairydreaming

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew What model would you like to test?

V3/R1, Q4_K_S.

Jan 29 '25 19:01 wronkiew

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew What model would you like to test?

V3/R1, Q4_K_S.

@wronkiew I don't have the model uploaded (my upload bandwidth is too low), you have to download, convert to bf16, convert to gguf and quantize the original model by yourself (or download one that is already converted to bf16, this will save you one step).

Jan 29 '25 21:01 fairydreaming

I spent some time investigating this hint from the DeepSeek V2 paper:

Fortunately, due to the associative law of matrix multiplication, we can absorb $𝑊^{𝑈𝐾}$ into $𝑊^{𝑈𝑄}$ , and $𝑊^{𝑈𝑉}$ into $𝑊^𝑂$

At first glance it looks reasonable, each absorbed matrix allows to replace two matrix multiplications with a single multiplication, thus reducing the number of operations.

However when we take a look into dimensions of these matrices, this stops being reasonable. For example in DeepSeek V2 lite:

$𝑊^{𝑈𝑄}$ tensor has shape [2048, 2048], that is [16, 2048, 128] after reshaping to 3d and permutation
$𝑊^{𝑈𝐾}$ tensor has shape [128, 8192], that is [16, 512, 128] after reshaping to 3d and permutation
combined "absorbed" tensor has shape [16, 512, 2048]

So (let's ignore the head dimension) this allows to replace two multiplications: with [2048, 128] matrix and [512, 128] matrix with a single multiplication with a [512, 2048]. The combined matrix has over 3x elements compared to individual matrices, so it will take more memory and it will be actually slower to multiply compared to two multiplications with smaller matrices.

With $𝑊^{𝑈𝑉}$ and $𝑊^𝑂$ it's the same story:

$𝑊^{𝑈𝑉}$ tensor has shape [2048, 512], that is [16, 512, 128] after reshaping to 3d and permutation
$𝑊^𝑂$ tensor has shape [2048, 2048], that is [16, 2048, 128] after reshaping to 3d and permutation
combined "absorbed" tensor has shape [16, 512, 2048]

I also found this blog post: https://github.com/xjdr-alt/mla_blog_translation where they mention:

Compared to performing projection with these particularly large low-rank matrices, it is obviously more advantageous to multiply them successively according to the low-rank decomposition form. Therefore, we believe that this optimization step is not very necessary.

So it looks like a dead end, it won't give us any speed gains.

Jan 30 '25 11:01 fairydreaming

I ran into an issue with DeepSeek-R1-UD-Q2_K_XL from unsloth/DeepSeek-R1-GGUF

llama_model_load: error loading model: missing tensor 'blk.0.attn_k_b.weight'                                                        llama_model_load_from_file_impl: failed to load model

Jan 31 '25 11:01 divine-taco

I ran into an issue with DeepSeek-R1-UD-Q2_K_XL from unsloth/DeepSeek-R1-GGUF

llama_model_load: error loading model: missing tensor 'blk.0.attn_k_b.weight'                                                        llama_model_load_from_file_impl: failed to load model

As I wrote in the PR:

Note that you need to reconvert the model to use this implementation.

Existing GGUFs won't work, you have to convert and quantize one with the code from this PR.

Jan 31 '25 12:01 fairydreaming

Ohh hmm should I re-quantize the ones in https://huggingface.co/unsloth/DeepSeek-R1-GGUF?

Feb 02 '25 02:02 danielhanchen

Ohh hmm should I re-quantize the ones in https://huggingface.co/unsloth/DeepSeek-R1-GGUF?

I think it's best to wait a bit until this is stable and merged, it's possible that there will be some changes that would cause them to stop working and you'd have to repeat the conversion again.

Feb 02 '25 07:02 fairydreaming

I updated the token generation performance plots in the PR post, also added some new showing the prompt processing performance. The optimized implementation generally performs WORSE in prompt processing - DeepSeek R1 671B Q4_K_S running on CPU performs only a little worse (~10% with 4k prompt), but DeepSeek V2 Lite Q8_0 running on RTX 4090 performs MUCH WORSE (~30% with 16k prompt) and in both cases the gap widens as the prompt length increases. So it's not all sunshine and rainbows.

Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this). This will prevent issues with existing GGUFs - they would keep working with existing architecture. I guess in this case the convert script would have to allow selection of the target model architecture with some option, but that shouldn't be difficult to add. @ggerganov what do you think?

Feb 02 '25 11:02 fairydreaming

Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this). This will prevent issues with existing GGUFs - they would keep working with existing architecture. I guess in this case the convert script would have to allow selection of the target model architecture with some option, but that shouldn't be difficult to add. @ggerganov what do you think?

While this is possible to do, I think it has a lot of cons. It will make it difficult for everyone to know which model variation on which hardware to use for better performance. Ideally, we want to have a single implementation that is optimal in all use cases, which can be deprecated at some point for a better alternative. But having 2 alternatives neither of which is optimal is not great.

Also, I'm not sure how this implementation fits with multiple parallel sequences and it introduces extra KV cache logic, specific to this type of arch.

I know there is a lot of interest in the DeepSeek arch right now and such optimizations are really important for people. But I think that we have to keep this work in a PR for a while. It is much more important to fix the software architecture in libllama after which such changes should become easier.

Feb 02 '25 11:02 ggerganov

Considering all these performance regressions I think the best course of action would be to put the optimized implementation into separate model architecture (LLM_ARCH_DEEPSEEK2_MLA or something like this). This will prevent issues with existing GGUFs - they would keep working with existing architecture. I guess in this case the convert script would have to allow selection of the target model architecture with some option, but that shouldn't be difficult to add. @ggerganov what do you think?

While this is possible to do, I think it has a lot of cons. It will make it difficult for everyone to know which model variation on which hardware to use for better performance. Ideally, we want to have a single implementation that is optimal in all use cases, which can be deprecated at some point for a better alternative. But having 2 alternatives neither of which is optimal is not great.

That may not be possible - IMHO MLA attention implementation that caches "compressed" latent kv representations introduces unavoidable computational overhead due to the need to "decompress" these representations in order to calculate attention scores and attention output. So "naive" attention implementation that caches full K/V vectors will always use less compute but more memory bandwidth, while caching latent representations results in using more compute, but less memory bandwidth. So there can't be a single implementation optimal in all use cases. I'd be happy to be proven wrong about this, though.

Also, I'm not sure how this implementation fits with multiple parallel sequences and it introduces extra KV cache logic, specific to this type of arch.

I think there shouldn't be any problems with this, as there is a straightforward direct mapping between the cached representations and full K/V vectors.

I know there is a lot of interest in the DeepSeek arch right now and such optimizations are really important for people. But I think that we have to keep this work in a PR for a while. It is much more important to fix the software architecture in libllama after which such changes should become easier.

That's fine with me. I'm taking a break from this anyway, got bored with tensor shuffling looking for 0.1 t/s more performance. 😉

Feb 02 '25 13:02 fairydreaming

@fairydreaming Is there any reason this should cause issues with RPC. Encountered:

ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU
evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT)
[...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed

I don't have a quant on hand that I can test without this branch, but this branch does give me a nice performance boost for TG at longer contexts, but RPC to CUDA does not work.

Feb 02 '25 14:02 saood06

What is the O() for memory and compute if we were to decompose existing non-MLA models using SVD?

Is this idea a waste of time or could there be any merit to using it to trade compute for memory use, etc?

If so then instead of having multiple architectures, perhaps it could be a more general option.

Feb 02 '25 14:02 jukofyork

@fairydreaming Is there any reason this should cause issues with RPC. Encountered:
ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU
evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT)
[...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed
I don't have a quant on hand that I can test without this branch, but this branch does give me a nice performance boost for TG at longer contexts, but RPC to CUDA does not work.

@saood06 Sorry, I have no idea. I can run my code just fine with a single RTX 4090 and -ngl 1, so it may be a RPC-related problem since a direct CUDA usage works fine. Unfortunately I'm not familiar with llama.cpp RPC.

Feb 02 '25 16:02 fairydreaming

ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT) [...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed

This happens because the RPC backend does not forward the supports_op calls to the RPC server, it assumes that it can run all the operations and does not allow fallback to CPU for unsupported ops. And this matrix multiplication is not supported on the CUDA backend.

Feb 02 '25 16:02 slaren

Sorry for what is likely a silly question, but does this have an impact on KV cache size when using full offload with CUDA? Because that would be very appealing at the moment given the current KV cache size without FA.

Feb 02 '25 16:02 davidsyoung

Sorry for what is likely a silly question, but does this have an impact on KV cache size when using full offload with CUDA? Because that would be very appealing at the moment given the current KV cache size without FA.

I'm not sure. Old KV cache tensors are still there in the implementation and are still allocated, although not used in a computation graph at all.

Feb 02 '25 16:02 fairydreaming

I was under the impression that MLA was specifically used because it uses way lower amounts of RAM, ~500+ GB of RAM for 160k context size is not really viable.

"The core idea behind MLA (Multi-Head latent attention) is the low-rank joint compression for key and value to reduce the KV cache."

Should I reserve time to look into it or can I rely on this PR and your work @fairydreaming? I'm in a process of building an inference cluster for our CZ hacker community, although it's on a tight budget using 2nd hand HW, I'd rather outsource via a bounty (would 500 EUR work?), should take us all less time if it gets attention from someone who has all the context in their head by now :)

(huge respect for taking this on)

Feb 02 '25 17:02 snajpa

I was under the impression that MLA was specifically used because it uses way lower amounts of RAM, ~500+ GB of RAM for 160k context size is not really viable.

"The core idea behind MLA (Multi-Head latent attention) is the low-rank joint compression for key and value to reduce the KV cache."

Should I reserve time to look into it or can I rely on this PR and your work @fairydreaming? I'm in a process of building an inference cluster for our CZ hacker community, although it's on a tight budget using 2nd hand HW, I'd rather outsource via a bounty (would 500 EUR work?), should take us all less time if it gets attention from someone who has all the context in their head by now :)

(huge respect for taking this on)

@snajpa Yes, that's how it will work eventually. There is an ongoing work in https://github.com/ggerganov/llama.cpp/pull/11213, when it's finished it will be possible to implement custom KV caches for models. Implementing a custom KV cache for DeepSeek will reduce the KV cache memory usage to expected amounts. Until this work is finished you can use my experimental hacky workarounds like allocating KV cache with mmap: #11580 (unfortunately it's CPU only). So the best course of action is to: do nothing, be patient.

Feb 02 '25 17:02 fairydreaming

ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT) [...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed

This happens because the RPC backend does not forward the supports_op calls to the RPC server, it assumes that it can run all the operations and does not allow fallback to CPU for unsupported ops. And this matrix multiplication is not supported on the CUDA backend.

@slaren Thanks for the hint. I think I see it now:

   2.sup]:           kr_cache-4 (   4K) [CUDA0    1.dst]               q_pe-4 (   8K) [CUDA0    2.sup]
   2.sup]:       kq_nope_perm-4 (   4K) [CUDA0   4.vsrc]         kq_pe_perm-4 (   4K) [CUDA0   4.vsrc]
   2.sup]:                 kq-4 (   4K) [CUDA0    2.sup]      CUDA0#KQ_mask#0 (   4K) [ NULL    4.cpy]
   2.sup]:     kv_cache_trans-4 (1022K) [CUDA0    1.dst] kq_soft_max_ext_perm (   4K) [CUDA0   4.vsrc]

ed_perm-4 (  64K)] 
  3.best]:         CPU#wv_b-4#0 (   1M) [ NULL    4.cpy] CPU#kqv_compressed_p (  64K) [ NULL    4.cpy]


  1.wgt0]: blk.4.attn_output.we (   4M) [CUDA0    1.dst]     CUDA0#kqv_2d-4#0 (  16K) [ NULL    4.cpy]
   2.sup]:            kqv_out-4 (  16K) [CUDA0   1.wgt0]              l_out-3 (  16K) [CUDA0    2.sup]
     usr]:            ffn_inp-4 (  16K) [CUDA0    2.sup]
  1.wgt1]:               norm-4 (  16K) [CUDA0      usr] blk.4.ffn_norm.weigh (   8K) [CUDA0    1.dst]
  1.wgt0]: blk.4.ffn_gate_inp.w ( 512K) [CUDA0    1.dst]           ffn_norm-4 (  16K) [CUDA0   1.wgt1]

This matmul is for example something like this:

ggml_debug: kqv-0 = (f32) MUL_MAT(wv_b-0{512, 128, 128, 1}, kqv_compressed_perm-0{512, 2, 128, 1}}) = {128, 2, 128, 1}

Could you save me some time and give some hints about how to best fix it?

wv_b is {latent_dim, v_head_dim, n_heads, 1},
kqv_compressed_perm is {latent_dim, 1, n_heads, n_tokens}

Edit: found it, it's because n_tokens is at ne[3].

Feb 02 '25 19:02 fairydreaming

The problem is that the CUDA backend does not support broadcasting on the 4th dimension. The error says:

src0->ne[3] = 1, src1->ne[3] = 2

This does not match the dimensions that you mentioned, which implies that ne[3] is 1 for both src0 and src1:

wv_b is {latent_dim, v_head_dim, n_heads, 1},

kqv_compressed_perm is {latent_dim, n_tokens, n_heads, 1}

Anyhow, I don't think that it would be very hard to support this in the CUDA backend, it's probably that way simply because it was never needed before. cc @JohannesGaessler

Feb 02 '25 19:02 slaren

The problem is that the CUDA backend does not support broadcasting on the 4th dimension. The error says:

src0->ne[3] = 1, src1->ne[3] = 2

This does not match the dimensions that you mentioned, which implies that ne[3] is 1 for both src0 and src1:

wv_b is {latent_dim, v_head_dim, n_heads, 1},

kqv_compressed_perm is {latent_dim, n_tokens, n_heads, 1}

Anyhow, I don't think that it would be very hard to support this in the CUDA backend, it's probably that way simply because it was never needed before. cc @JohannesGaessler

That's it, thanks!

Feb 02 '25 19:02 fairydreaming

Anyhow, I don't think that it would be very hard to support this in the CUDA backend, it's probably that way simply because it was never needed before.

Yes, it should be fairly straightforward to add support for dimension 3 != 1 to CUDA, it just wasn't a priority. If someone writes a test case in test-backend-ops for me I can make the necessary changes (I don't yet know how much time I'll have next week but I should be able to get it done by the weekend at the latest).

Feb 02 '25 19:02 JohannesGaessler

@fairydreaming Is there any reason this should cause issues with RPC. Encountered:
ggml_cuda_compute_forward: cannot compute kqv-31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU
evaluate_and_capture_cuda_graph: op not supported kqv-31 (MUL_MAT)
[...]\llama.cpp\ggml\src\ggml-cuda\ggml-cuda.cu:2660: GGML_ASSERT(ok) failed
I don't have a quant on hand that I can test without this branch, but this branch does give me a nice performance boost for TG at longer contexts, but RPC to CUDA does not work.

@saood06 Try these changes in src/llama.cpp:

diff --git a/src/llama.cpp b/src/llama.cpp
index 1a3d1d0b..cccd046e 100644
--- a/src/llama.cpp
+++ b/src/llama.cpp
@@ -6580,7 +6581,7 @@ struct llm_build_context {
                 cb(kqv_compressed, "kqv_compressed", il);
 
                 if (!pp_opt) {
-                    kqv_compressed = ggml_permute(ctx0, kqv_compressed, 0, 2, 3, 1);
+                    kqv_compressed = ggml_permute(ctx0, kqv_compressed, 0, 2, 1, 3);
                     cb(kqv_compressed, "kqv_compressed_perm", il);
                 }
 
@@ -6590,10 +6591,10 @@ struct llm_build_context {
                 struct ggml_tensor * kqv = ggml_mul_mat(ctx0, wv_b, kqv_compressed);
                 cb(kqv, "kqv", il);
 
-                if (pp_opt) {
+//                if (pp_opt) {
                     kqv = ggml_cont(ctx0, ggml_permute(ctx0, kqv, 0, 2, 1, 3));
                     cb(kqv, "kqv_perm", il);
-                }
+//                }
 
                 cur = ggml_view_2d(ctx0, kqv, n_embd_head_v*n_head, n_tokens, ggml_row_size(kqv->type, n_embd_head_v*n_head), 0);
                 cb(cur, "kqv_2d", il);

It will add one extra ggml_cont() and ggml_permute() during inference, so performance may be a little worse.

Feb 02 '25 19:02 fairydreaming

If someone writes a test case in test-backend-ops for me I can make the necessary changes

Tests should already exist in the test cases with nr = {1,2}, such as this one:

MUL_MAT(type_a=f32,type_b=f32,m=16,n=1,k=256,bs=[1,1],nr=[1,2],per=[0,1,2,3]): not supported [CUDA0]

Feb 02 '25 19:02 slaren

@fairydreaming

Thank you for your excellent work. I'm getting almost the same performance (around 7 tokens/s) on a 9654 machine. I tried increasing --threads from 32 to 64 or higher, but there was virtually no speed improvement, sometimes even a slight decrease. Is there still room to gain by using more cores for inference, or is this strictly limited by memory bandwidth?

Feb 03 '25 06:02 evshiron

@fairydreaming

Thank you for your excellent work. I'm getting almost the same performance (around 7 tokens/s) on a 9654 machine. I tried increasing --threads from 32 to 64 or higher, but there was virtually no speed improvement, sometimes even a slight decrease. Is there still room to gain by using more cores for inference, or is this strictly limited by memory bandwidth?

@evshiron One thing you can try is to set NUMA per Socket to NPS4 and enable ACPI SRAT L3 as NUMA domain in BIOS and run llama.cpp with --numa distribute, then try to find the optimal number of threads. (make sure to disable numa balancing first)

Feb 03 '25 07:02 fairydreaming

Btw, there are 3 ggml_cont ops in the implementation on master that are only there because of CUDA not supporting non-contiguous norm and rope (the latter should be fixed, but it's not tested). We should probably first improve the baseline by avoiding these conts.

Feb 03 '25 13:02 ggerganov

@evshiron One thing you can try is to set NUMA per Socket to NPS4 and enable ACPI SRAT L3 as NUMA domain in BIOS and run llama.cpp with --numa distribute, then try to find the optimal number of threads. (make sure to disable numa balancing first)

@fairydreaming

Thanks for your reply. Unfortunately, I only have remote access to the machine.

@fairydreaming do you have a converted model available or instructions for replicating your setup? I would like to run some benchmarks on these changes.

@wronkiew

So I made a converted Q3_K_M model for DeepSeek R1 and it worked just fine. Here is the link if you are interested:

https://huggingface.co/daydream-org/DeepSeek-R1-GGUF-11446 (Q3_K_M for 360GB+ RAM)
https://huggingface.co/daydream-org/DeepSeek-V2.5-GGUF-11446 (Q4_K_M for 180GB+ RAM)

I came up with a change to convert_hf_to_gguf.py to allow conversion from fp8 HF checkpoints to GGUF without the need to restore an intermediate bf16 HF checkpoints, which should be helpful if the storage is limited.

EDIT: With triton-cpu it can dequant without a GPU. And the missing 26 and 28 parts are re-uploaded.

EDIT2: A Q4_K_M model for DeepSeek-V2.5-1210 is uploaded for testing. As TG is bandwidth-bound, using smaller quants or activating less parameters should improve the performance.

Feb 03 '25 13:02 evshiron