llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Feature Request: DeepSeek V3.2-Exp support

Open createthis opened this issue 2 months ago • 54 comments

Prerequisites

  • [x] I am running the latest code. Mention the version if possible as well.
  • [x] I carefully followed the README.md.
  • [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • [x] I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Support DeepSeek V3.2-Exp: https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp

Motivation

Because we can.

Possible Implementation

VLLM code is here for reference: https://github.com/vllm-project/vllm/pull/25896

createthis avatar Sep 29 '25 12:09 createthis

I am not an expert on LLM. I put flashmla_sparse.py to Qwen-Max and it shows

Image

lovedheart avatar Sep 29 '25 13:09 lovedheart

It's really just the sparse attention that's going to be a problem:

Image

(Ignore the scale_inv as they are for the quantisation)

jukofyork avatar Sep 29 '25 13:09 jukofyork

This is the important code in VLLM:

https://github.com/vllm-project/vllm/blob/88ef733ff5125b1b29e7aaae13063b2a109d5e8f/vllm/model_executor/models/deepseek_v2.py#L768

https://github.com/vllm-project/vllm/blob/88ef733ff5125b1b29e7aaae13063b2a109d5e8f/vllm/model_executor/models/deepseek_v2.py#L584

jukofyork avatar Sep 29 '25 14:09 jukofyork

https://github.com/deepseek-ai/DeepSeek-V3.2-Exp/blob/main/DeepSeek_V3_2.pdf

Image

So it looks like it will need an extra cache to store a 128-element k_s value for each token at each layer, but other than this the actual calculation is fairly straightforward (assuming GGML has the needed primitive operation for the sparse multiplication).

jukofyork avatar Sep 29 '25 14:09 jukofyork

@slaren is ggml_mul_mat_id() likely to work for this:

  • Much larger k (2048) and n (~128k) compared to MoE selection logic.
  • It's going to be multiplying a lot of 1xm tensors (specifically lots of 512 or 576 element vectors) instead of nxm tensors compared to the MoE logic.

jukofyork avatar Sep 29 '25 14:09 jukofyork

Alternatively, we could just mask out the non-top-k tokens with zeros and do a normal ggml_mul_mat, but this seems a bit crap and doesn't really get any of the gains in speed from the sparsity...

jukofyork avatar Sep 29 '25 14:09 jukofyork

I think it is very likely that at least it will be faster than a masked matrix multiplication, but ultimately it will depend on how well optimized are the backends for this case. You should be able to test the performance by using test-backend-ops in perf mode, and adding the relevant test cases to make_test_cases_perf.

slaren avatar Sep 29 '25 22:09 slaren

I've uploaded the bf16 here, if anyone needs it: https://huggingface.co/createthis/DeepSeek-V3.2-Exp-bf16/tree/main

createthis avatar Sep 30 '25 23:09 createthis

(/data/conda-envs/llama_cpp_cu128) jesse@larry:~/llama.cpp$ python     convert_hf_to_gguf.py     --outtype f16     --split-max-size 50G     --outfile /data2/DeepSeek-V3.2-Exp-GGUF/bf16     /data2/DeepSeek-V3.2-Exp-bf16/
INFO:hf-to-gguf:Loading model: DeepSeek-V3.2-Exp-bf16
WARNING:hf-to-gguf:Failed to load model config from /data2/DeepSeek-V3.2-Exp-bf16: The checkpoint you are trying to load has model type `deepseek_v32` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.
WARNING:hf-to-gguf:Trying to load config.json instead
INFO:hf-to-gguf:Model architecture: DeepseekV32ForCausalLM
ERROR:hf-to-gguf:Model DeepseekV32ForCausalLM is not supported
(/data/conda-envs/llama_cpp_cu128) jesse@larry:~/llama.cpp$

Looks like we're blocked on https://github.com/huggingface/transformers/issues/41196 upstream.

It can be hacked around with:

pip install git+https://github.com/hnyls2002/transformers@2a8a2a9

Then we're back to just needing to write code, it seems, so I started a branch:

https://github.com/createthis/llama.cpp/pull/9/files

(/data/conda-envs/llama_cpp_cu128) jesse@larry:~/llama.cpp$ python     convert_hf_to_gguf.py     --outtype f16     --split-max-size 50G     --outfile /data2/DeepSeek-V3.2-Exp-GGUF/bf16     /data2/DeepSeek-V3.2-Exp-bf16/
INFO:hf-to-gguf:Loading model: DeepSeek-V3.2-Exp-bf16
INFO:hf-to-gguf:Model architecture: DeepseekV32ForCausalLM
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model weight map from 'model.safetensors.index.json'
INFO:hf-to-gguf:gguf: loading model part 'model-00001-of-000163.safetensors'
INFO:hf-to-gguf:token_embd.weight,            torch.bfloat16 --> F16, shape = {7168, 129280}
INFO:hf-to-gguf:blk.0.attn_norm.weight,       torch.float32 --> F32, shape = {7168}
INFO:hf-to-gguf:blk.0.ffn_down.weight,        torch.bfloat16 --> F16, shape = {18432, 7168}
INFO:hf-to-gguf:blk.0.ffn_gate.weight,        torch.bfloat16 --> F16, shape = {7168, 18432}
INFO:hf-to-gguf:blk.0.ffn_up.weight,          torch.bfloat16 --> F16, shape = {7168, 18432}
INFO:hf-to-gguf:blk.0.ffn_norm.weight,        torch.float32 --> F32, shape = {7168}
Traceback (most recent call last):
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 9366, in <module>
    main()
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 9360, in main
    model_instance.write()
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 429, in write
    self.prepare_tensors()
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 6687, in prepare_tensors
    super().prepare_tensors()
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 300, in prepare_tensors
    for new_name, data_torch in (self.modify_tensors(data_torch, name, bid)):
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 6684, in modify_tensors
    return [(self.map_tensor_name(name), data_torch)]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jesse/llama.cpp/convert_hf_to_gguf.py", line 259, in map_tensor_name
    raise ValueError(f"Can not map tensor {name!r}")
ValueError: Can not map tensor 'model.layers.0.self_attn.indexer.k_norm.bias'
(/data/conda-envs/llama_cpp_cu128) jesse@larry:~/llama.cpp$

I'll work on this in my spare time, if I have time, and no one more experienced steps in.

createthis avatar Oct 01 '25 00:10 createthis

Quick update ... I woke up early this morning and did my best to figure out the tensor mappings. I've updated my branch: https://github.com/createthis/llama.cpp/pull/9/files

Using this command the GGUF bf16 generates successfully:

python convert_hf_to_gguf.py \
  --outtype f16 \
  --split-max-size 50G \
  --outfile /data2/DeepSeek-V3.2-Exp-GGUF/bf16 \
  /data2/DeepSeek-V3.2-Exp-bf16/
Shard (30/30): 100%|█████████████████████████████████████████████████████████████████████████████| 7.52G/7.52G [00:16<00:00, 451Mbyte/s]
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████| 1.34T/1.34T [49:23<00:00, 454Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /data2/DeepSeek-V3.2-Exp-GGUF/bf16/

This was monkey work, but I'm not 100% sure I named everything correctly. Can someone double check before we move forward with the hard part? Thanks!

createthis avatar Oct 02 '25 13:10 createthis

Checking https://huggingface.co/deepseek-ai/DeepSeek-V3.2-Exp there's some cpp kernel code which might help more than python: https://github.com/deepseek-ai/DeepGEMM/pull/200/commits/7c95b14aa4a66edd7b682e5acdde62351ca81197 https://github.com/deepseek-ai/FlashMLA/pull/98/commits/c28eca99dbc664dd2716415ed03492afe5fefade

mattepiu avatar Oct 03 '25 06:10 mattepiu

Quick update ... I woke up early this morning and did my best to figure out the tensor mappings. I've updated my branch: https://github.com/createthis/llama.cpp/pull/9/files

Using this command the GGUF bf16 generates successfully:

python convert_hf_to_gguf.py
--outtype f16
--split-max-size 50G
--outfile /data2/DeepSeek-V3.2-Exp-GGUF/bf16
/data2/DeepSeek-V3.2-Exp-bf16/

Shard (30/30): 100%|█████████████████████████████████████████████████████████████████████████████| 7.52G/7.52G [00:16<00:00, 451Mbyte/s]
Writing: 100%|███████████████████████████████████████████████████████████████████████████████████| 1.34T/1.34T [49:23<00:00, 454Mbyte/s]
INFO:hf-to-gguf:Model successfully exported to /data2/DeepSeek-V3.2-Exp-GGUF/bf16/

This was monkey work, but I'm not 100% sure I named everything correctly. Can someone double check before we move forward with the hard part? Thanks!

I'm downloading it now, but will be a couple of days to 4MB/s.

jukofyork avatar Oct 03 '25 10:10 jukofyork

FYI: Aider polyglot benchmarks indicate the VLLM code is underperforming vs deepseek's own stack. See https://discord.com/channels/1131200896827654144/1422153923493498902/1423627025062690816 for more info. We should be careful when cargo culting VLLM's code and see if we can identify the bug.

createthis avatar Oct 03 '25 13:10 createthis

With the last two commits I've raised the tensor count error from expected 1391, got 1086 to expected 1391, got 1330. Seems like progress. Still missing something though. I deleted the GGUF I previously posted because I had to regenerate it after the last commit, which makes me think I'll have to do that again at least one more time.

Also, kinda funny... the last two commits have been authored by DeepSeek-V3.1-Terminus running under Open Hands, because I don't really know what I'm looking for at this point, so I'm just giving it the information I have and letting it go to work. I'm impressed it was able to make progress. The last couple of times I tried this method it didn't work.

createthis avatar Oct 06 '25 14:10 createthis

Haha. Wow. As of bb1f4e4 it is inferring on my system locally. Performance is terrible, and the last three or four commits were entirely authored by DeepSeek V3.1-Terminus, so I need to audit the snot out of them because I really don't understand what they're doing. 😂 But it's inferring.

I'll try to upload a GGUF. HF was refusing my uploads last night due to too much data being used, but I deleted the last Q4_K_M, so maybe they'll let me re-upload it.

createthis avatar Oct 06 '25 17:10 createthis

Here's the Q4_K_M GGUF: https://huggingface.co/createthis/DeepSeek-V3.2-Exp-GGUF/tree/main/q4_k_m

EDIT: I tried to upload the bf16 GGUF again, but I was hit with this gem:

Your storage patterns tripped our internal systems! Please contact us at [email protected] so we can verify your account and unlock more storage for your use-case.

EDIT2: They just got back to me. Apparently I've reached my free public storage limit and I have to upgrade to a pro account if I want to upload more.

createthis avatar Oct 06 '25 17:10 createthis

I deleted some old models. Here's the bf16 GGUF: https://huggingface.co/createthis/DeepSeek-V3.2-Exp-GGUF/tree/main/bf16

createthis avatar Oct 07 '25 12:10 createthis

I turned off the logging statement. Performance is actually just as good as V3.1-Terminus on my machine:

echo 0 | sudo tee /proc/sys/kernel/numa_balancing
echo 3 | sudo tee /proc/sys/vm/drop_caches

./build/bin/llama-server \
    --model  /data2/DeepSeek-V3.2-Exp-GGUF/q4_k_m/DeepSeek-V3.2-Exp-Q4_K_M-00001-of-00009.gguf \
    --alias DeepSeek-V3.2-Exp:671b-q4_k_m \
    --no-webui \
    --numa numactl \
    --threads 32 \
    --ctx-size 163840 \
    --n-gpu-layers 62 \
    -ot "blk\.(3|4|5|6|7|8|9|10)\.ffn_.*=CUDA0" \
    -ot exps=CPU \
    -ub 4096 -b 4096 \
    --seed 3407 \
    --temp 0.6 \
    --top-p 0.95 \
    --min-p 0.1 \
    --log-colors on \
    --flash-attn on \
    --host 0.0.0.0 \
    --prio 2 \
    --jinja \
    --port 11434
slot      release: id  0 | task 1688 | stop processing: n_past = 2082, truncated = 0
slot print_timing: id  0 | task 1688 | 
prompt eval time =   11158.46 ms /   918 tokens (   12.16 ms per token,    82.27 tokens per second)
       eval time =   49394.41 ms /  1164 tokens (   42.44 ms per token,    23.57 tokens per second)
      total time =   60552.87 ms /  2082 tokens
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.0.201 200
srv  log_server_r: request: GET /v1/models 192.168.0.201 200
srv  params_from_: Chat format: DeepSeek V3.1
slot get_availabl: id  0 | task 1688 | selected slot by lcs similarity, lcs_len = 918, similarity = 0.441 (> 0.100 thold)
slot launch_slot_: id  0 | task 2853 | processing task
slot update_slots: id  0 | task 2853 | new prompt, n_ctx_slot = 163840, n_keep = 0, n_prompt_tokens = 1050
slot update_slots: id  0 | task 2853 | kv cache rm [918, end)
slot update_slots: id  0 | task 2853 | prompt processing progress, n_past = 1050, n_tokens = 132, progress = 0.125714
slot update_slots: id  0 | task 2853 | prompt done, n_past = 1050, n_tokens = 132

I think at this point we're not using the fancy architectural features for max performance. We're just using the same patterns we've always used. I'm going to use V3.2-Exp to audit it's own inference code at this point, then once I'm satisfied with it, I'll look into implementing the fancier stuff.

createthis avatar Oct 07 '25 12:10 createthis

I think at this point we're not using the fancy architectural features for max performance. We're just using the same patterns we've always used. I'm going to use V3.2-Exp to audit it's own inference code at this point, then once I'm satisfied with it, I'll look into implementing the fancier stuff.

It will probably work somewhat (at least at low context lengths) as the paper said it was trained from 3.1-terminus using around 1T tokens to learn the top-k stuff, but it won't work 100% properly as the sum going into the attention calculation will be larger that the model expects (ie: the values at the non-top-k indices should add zero into the sum). I suspect you'll start to see it break down more and more as you increase the context length and these sums become larger and larger.

jukofyork avatar Oct 07 '25 14:10 jukofyork

It will probably work somewhat (at least at low context lengths) as the paper said it was trained from 3.1-terminus using around 1T tokens to learn the top-k stuff, but it won't work 100% properly as the sum going into the attention calculation will be larger that the model expects (ie: the values at the non-top-k indices should add zero into the sum). I suspect you'll start to see it break down more and more as you increase the context length and these sums become larger and larger.

I could certainly test that using LongBench 2, or by running the Aider Polyglot, but I feel like it would be a waste of time. We have DS V3.1-Terminus working perfectly right now. The only value V3.2-Exp has at the moment, IMO, is its enhanced performance. I'm not going to open a PR against master until I feel like we've achieved at least partial performance gains. I just think it's awesome that I got it inferring at all as this project reaches pretty far outside my comfort zone.

My current plan is to audit the code that has been written, then simplify it and make any corrections once I fully understand it. Then I'll move on to the next step of implementing the performance gains.

Anyone who wants to give it a go in the meantime is welcome to fork my branch. I'm thrilled with the progress I've made so far but if anyone else can finish it up faster it won't hurt my feelings one bit.

createthis avatar Oct 07 '25 16:10 createthis

It looks like the chat template has changed since V3.1. Here's the new chat template in human readable format for reference: https://gist.github.com/createthis/1467af177656a034f14ae01053a714f9

createthis avatar Oct 07 '25 23:10 createthis

I had to switch back to 3.1-Terminus for agentic work because 3.2-Exp is falling into degenerate generation around 45k context with this code.

createthis avatar Oct 08 '25 02:10 createthis

I haven't been able to look at this all week due to paid work obligations. However, my weekend has recently cleared, so I'm going to dedicate that time to this project.

It looks like GGML is the backend llama.cpp uses. I think we're going to need some GGML changes to support sparse attention. GGML is very low level, so this is intimidating, but I have all weekend and I'm excited to learn.

createthis avatar Oct 11 '25 12:10 createthis

Just in case anyone else is super dense like me and needs to ask questions, here's that Architecture page from the PDF as markdown. Click details, below, to expand and copy and paste so you can have a conversation about it. NOTE: I audited this for accuracy several times, but I still may have missed something.

1. Architecture

Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of DeepSeek Sparse Attention (DSA) through continued training.

Prototype of DSA. The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.

The lightning indexer computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$ and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:

$$ I_{t,s}=\sum_{j=1}^{H^{I}} w^{I}{t,j}\cdot\mathrm{ReLU}\left(\mathbf{q}^{I}{t,j}\cdot \mathbf{k}^{I}_{s}\right), $$

$$ \tag{1} $$

where $H^{I}$ denotes the number of indexer heads; $\mathbf{q}^{I}_{t,j}\in\mathbb{R}^{d^{I}}$ and $w^{I}_{t,j}\in\mathbb{R}$ are derived from the query token $\mathbf{h}_t$; and $\mathbf{k}^{I}_{s}\in\mathbb{R}^{d^{I}}$ is derived from the preceding token $\mathbf{h}_s$. We choose ReLU as the activation function for throughput consideration. Given that the lightning indexer has a small number of heads and can be implemented in FP8, its computational efficiency is remarkable.

Given the index scores $\{I_{t,s}\}$ for each query token $\mathbf{h}_t$, our fine-grained token selection mechanism retrieves only the key–value entries $\{c_s\}$ corresponding to the top-$k$ index scores. Then, the attention output $\mathbf{u}_t$ is computed by applying the attention mechanism between the query token $\mathbf{h}_t$ and the sparsely selected key-value entries $\{\mathbf{c}_s\}$:

\mathbf{u}_t=\mathrm{Attn}\left(\mathbf{h}_t,\{c_s\mid I_{t,s}\in \mathrm{Top}\text{-}k(I_{t,:})\}\right).

$$ \tag{2} $$

**1. Architecture**

Compared with DeepSeek-V3.1-Terminus, the last version of DeepSeek-V3.1, the only architectural modification of DeepSeek-V3.2-Exp is the introduction of **DeepSeek Sparse Attention (DSA)** through continued training.

**Prototype of DSA.** The prototype of DSA primarily consists of two components: a lightning indexer and a fine-grained token selection mechanism.

The **lightning indexer** computes an index score $I_{t,s}$ between the query token $\mathbf{h}_t\in\mathbb{R}^d$ 
and a preceding token $\mathbf{h}_s\in\mathbb{R}^d$, determining which tokens to be selected by the query token:

$$
I_{t,s}=\sum_{j=1}^{H^{I}} w^{I}_{t,j}\cdot\mathrm{ReLU}\left(\mathbf{q}^{I}_{t,j}\cdot \mathbf{k}^{I}_{s}\right),
$$
$$
\tag{1}
$$

where $H^{I}$ denotes the number of indexer heads; $\mathbf{q}^{I}\_{t,j}\in\mathbb{R}^{d^{I}}$ and $w^{I}_{t,j}\in\mathbb{R}$
 are derived from the query token $\mathbf{h}_t$; and $`\mathbf{k}^{I}_{s}\in\mathbb{R}^{d^{I}}`$ is derived from the 
preceding token $\mathbf{h}_s$. We choose ReLU as the activation function for throughput consideration. Given that the lightning indexer has a small number of heads and can be implemented in FP8, its computational efficiency is remarkable.

Given the index scores $`\{I_{t,s}\}`$ for each query token $\mathbf{h}_t$, our **fine-grained token selection mechanism** 
retrieves only the key–value entries $`\{c_s\}`$ corresponding to the top-$`k`$ index scores. Then, the attention output
$\mathbf{u}_t$ is computed by applying the attention mechanism between the query token $\mathbf{h}_t$ and the sparsely
selected key-value entries $`\{\mathbf{c}_s\}`$:

```math
\mathbf{u}_t=\mathrm{Attn}\left(\mathbf{h}_t,\{c_s\mid I_{t,s}\in \mathrm{Top}\text{-}k(I_{t,:})\}\right).
```
$$
\tag{2}
$$

createthis avatar Oct 11 '25 19:10 createthis

Some cheat sheet explanations for those of us who didn't get a Ph.d in machine learning. I have a background in 3d programming and my math education included basic calculus, so these explanations are tailored to me:

  • A dot product measures how aligned two vectors are. For example, if you have vectors like (3,4) and (1,2), you multiply corresponding parts $3\cdot{1} + 4\cdot{2} = 11$. A higher result means more similarity.
  • $ReLU$ is an activation function that outputs the input if it's positive, otherwise zero. For instance, $ReLU(5) = 5$, but $ReLU(-3) = 0$. In DSA, it filters out negative similarities, focusing only on relevant token alignments.
  • $\in$ means "is an element of" or "belongs to." For instance, in 3D, a vector like $(x,y,z) \in \mathbb{R}^3$ indicates it's part of the set of all possible 3D vectors. So, $\mathbf{h}_t \in \mathbb{R}^d$ simply means $\mathbf{h}_t$ is a vector in a d-dimensional space.
  • $\mathbf{h}_t \in \mathbb{R}^d$ means that $\mathbf{h}_t$ is a vector with $d$ real-number components, representing the embedding of token $t$. In 3D terms, it's like a point in space but with $d$ dimensions instead of 3. This vector captures the token's features for attention calculations.
  • $\mathbf{q}^{I}_{t,j}\in\mathbb{R}^{d^{I}}$ $\mathbf{q}^{I}_{t,j}$ is the query vector for the indexer, specific to token $t$ and head $j$. $\in \mathbb{R}^{d^{I}}$ means it's a vector with $d^{I}$ real-number components, where $d^{I}$ is the dimension used for the indexer's computations. This vector helps compute similarity scores in the attention mechanism. $d^{I}$ is a separate dimension used for the indexer's vectors, not a subset of $\mathbb{R}^d$. It's typically smaller than $d$ to improve efficiency. The indexer projects $\mathbf{h}_t$ into $\mathbb{R}^{d^{I}}$ for faster similarity calculations.
  • $w^{I}_{t,j} \in \mathbb{R}$ is a scalar weight for the indexer, specific to token (t) and head (j). It scales the dot product in Equation (1).
  • $\sum$ means summation—it adds up the terms for $j$ from 1 to $H^I$. In Equation (1), it accumulates the weighted $ReLU$ outputs across all indexer heads. It's standard addition, just applied repeatedly over a range.
  • $\{ \}$ indicate a set—here, $\{I_{t,s}\}$ refers to the collection of all index scores for query token $t$ compared to every preceding token $s$. Think of it as a list of relevance scores.
  • Top-k means selecting the k highest values from a set. In DSA, it chooses the top k index scores $I_{t,s}$ to retrieve only the most relevant key-value pairs, reducing computation. For example, if token $t$ has index scores [0.8, 0.3, 0.9, 0.5] for previous tokens, and k=2, it chooses tokens with scores 0.9 and 0.8 for attention, ignoring others. This sparsity boosts efficiency.

And some Q and A:

  • Q. In $\mathbf{h}_t\in\mathbb{R}^d$ and $\mathbf{h}_s\in\mathbb{R}^d$, do the subscripts t and s have significant meaning? A. Yes, t denotes the current token position, and s denotes a previous token position. In attention, $\mathbf{h}_t$ (query) is compared with $\mathbf{h}_s$ (key) to compute relevance scores.

  • Q. Can you break down each math equation for me? A. Equation (1): $I_{t,s}$ is a score showing how relevant a past token is to the current one. It's calculated by summing up weighted similarities (dot products) between vectors, after applying $ReLU$ (which ignores negative values), across multiple "heads" for better accuracy.

    Equation (2): $\mathbf{u}_t$ is the output, computed by focusing only on the top-$k$ tokens with the highest scores from (1), making attention sparse and efficient.

createthis avatar Oct 11 '25 21:10 createthis

tilelang has a deepseek32 example folder: https://github.com/tile-ai/tilelang/tree/main/examples/deepseek_v32

It includes:

  • FP8 lightning indexer: https://github.com/tile-ai/tilelang/blob/main/examples/deepseek_v32/fp8_lighting_indexer.py
  • top-k selector: https://github.com/tile-ai/tilelang/blob/main/examples/deepseek_v32/topk_selector.py
  • MQA examples (see README)

createthis avatar Oct 11 '25 23:10 createthis

Here is an analysis by DeepSeek V3.1-Terminus of the fp8 lightning indexer example, when given the DSA PDF context: https://gist.github.com/createthis/0cce8a250daa3a117cb2986c743c02f2

And topk_selector.py: https://gist.github.com/createthis/69417474e24ca7a8096ce5a08227ab0c

This is helpful (to me) because it points out which parts of the equation are being implemented by which parts of the code. Also, I'm not a tilelang guy, so it explains the tilelang specific things, like decorators.

createthis avatar Oct 12 '25 14:10 createthis

Quick update:

Over the weekend I transitioned from focusing on cloning the VLLM functionality to cloning the tilelang functionality. I feel like tilelang's examples are just better organized for educational purposes.

I had DeepSeek-V3.1-Terminus move the sparse attention code out into separate files:

  • src/llama-sparse-attn.cpp
  • src/llama-sparse-attn.h

I may have it further break the code down into the fp8 indexer and the topk selector like the tilelang code.

I also had it add a unit test as I got tired of loading the actual model and wanted a faster dev/debug cycle:

  • tests/test-sparse-attn.cpp

This test is supposed to be based on https://github.com/tile-ai/tilelang/blob/main/examples/deepseek_v32/test_tilelang_example_deepseek_v32.py but:

  1. I can't run that file locally because my blackwell 6000 pro is unsupported in tilelang - see https://github.com/tile-ai/tilelang/issues/370#issuecomment-3395136162
  2. I have not audited it for correctness. It is vibe coded and as such it is probably super super incorrect.

This code compiles, but the unit test core dumps.

I've read through the tilelang fp8 indexer and the topk selector code. I don't entirely understand it from an algorithmic perspective, but I do understand the individual operations, so it's probably within my ability to grasp with enough effort.

I have not read through the vibe coded llama.cpp sparse attention code yet and I suspect it is total junk.

I have to work on paid work today, but here are my next steps, just thinking out loud and as a reminder to myself in case it takes me a while to task shift back to this (in no particular order):

  • ~~See if tilelang will compile and run on my macbook pro using a conda env. I'd like to be able to run the official test case and poke around in the code.~~ UPDATE: It won't. It needs Triton, which doesn't support MacOS.
  • ~~See if I can get llama.cpp to compile on my macbook pro so that I can run the unit test locally. This would speed up my agentic dev cycle a bit as I could have Terminus just build the code and re-run the test itself without me in the loop.~~ I did manage to get it running on my Mac and in my Open Hands docker container. Interesting.
  • ~~Have Terminus break src/llama-sparse-attn.cpp down into the fp8 indexer and topk selector as individual files. This will probably reduce Terminus's hallucinations.~~
  • ~~Have Terminus output the llama.cpp sparse attention code with explanation commentary so that I can start auditing the official examples vs the vibe coded junk Terminus is creating and hopefully start getting the code to a point of sanity.~~ Just need to read it now.
  • Do a top-k leetcode example or three and watch some youtube top-k videos. I don't entirely understand the algorithm yet.

I'm having fun and learning a lot, but this is a ton of knowledge to consume and I don't think I'm going to be done anytime soon at my current burn rate. If anyone with more experience wants to step in and take over, it won't hurt my feelings. Until then, I'll just keep working on this in my spare time. The good news is that I should have more free time this week, so hopefully the pace will accelerate as I work through the backlog of context I need to be effective. We'll see.

createthis avatar Oct 13 '25 14:10 createthis

New File Structure:

  1. llama-sparse-indexer.cpp/h - Lightning Indexer (FP8 computation)
    • Corresponds to fp8_lighting_indexer.py
    • Implements: $I_{t,s}=\sum_{j=1}^{H^{I}} w^{I}_{t,j}\cdot\mathrm{ReLU}\left(\mathbf{q}^{I}_{t,j}\cdot \mathbf{k}^{I}_{s}\right)$
  2. llama-sparse-topk.cpp/h - Top-k Selector
    • Corresponds to topk_selector.py
    • Selects the most important tokens for sparse attention
  3. llama-sparse-mla-fwd.cpp/h - Sparse Multi-Query Attention Forward
    • Corresponds to sparse_mla_fwd.py
    • Applies attention only to the top-k selected tokens

createthis avatar Oct 13 '25 16:10 createthis

Now that I have a proper closed loop agentic workflow, I've had DS V3.1-Terminus running up my electricity bill all day getting the unit tests passing.

I nudged it to switch from mocking llama_model to instead use a real llama_model and programmatically populate it. That worked. Unit tests are passing.

Next, I instructed it to audit the code:

Summary

I have successfully completed the audit of the DeepSeek V3.2-Exp sparse attention implementation in llama.cpp against the tilelang reference implementations. Here's what I found and fixed:

Key Findings from the Audit:

  1. Sparse Indexer Implementation (llama-sparse-indexer.cpp vs fp8_lighting_indexer.py):

    • The C++ implementation correctly implements the mathematical formula: I_{t,s} = sum_{j=1}^{H^I} w^I_{t,j} * ReLU(q^I_{t,j} · k^I_s)
    • Main differences: Tilelang uses FP8 precision and block processing for efficiency, while C++ uses FP32 and processes entire sequences
    • Status: Functionally correct, tests passing
  2. Top-k Selector Implementation (llama-sparse-topk.cpp vs topk_selector.py):

    • The C++ implementation uses a simplified approach with ggml_top_k function
    • Tilelang uses a sophisticated radix-sort based algorithm with multiple stages
    • Status: Simplified but functional, tests passing
  3. Sparse MLA Forward Implementation (llama-sparse-mla-fwd.cpp vs sparse_mla_fwd.py):

    • The C++ implementation has a critical issue: it is taking the first top_k tokens instead of using the actual indices from top-k selection
    • I attempted to fix this using ggml_get_rows but encountered GGML assertion errors

Attempting to run inference is currently failing with:

/home/jesse/llama.cpp/ggml/src/ggml.c:3023: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed
/home/jesse/llama.cpp/build/bin/libggml-base.so(+0x1817b)[0x787979e5b17b]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_print_backtrace+0x21c)[0x787979e5b5dc]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_abort+0x15b)[0x787979e5b7bb]
/home/jesse/llama.cpp/build/bin/libggml-base.so(ggml_mul_mat+0x45)[0x787979e60cb5]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN5llama19sparse_attn_indexer24compute_token_importanceEP12ggml_contextRK11llama_modeliP11ggml_tensorbRKSt8functionIFvS7_PKciEE+0x43c)[0x787979da15ec]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN21llm_build_deepseek3_2C1ERK11llama_modelRK16llm_graph_params+0x463)[0x787979d6f8d3]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZNK11llama_model11build_graphERK16llm_graph_params+0x76b)[0x787979d1717b]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_context13graph_reserveEjjjPK22llama_memory_context_ib+0x129)[0x787979ca45e9]
/home/jesse/llama.cpp/build/bin/libllama.so(_ZN13llama_contextC2ERK11llama_model20llama_context_params+0xe90)[0x787979ca94a0]
/home/jesse/llama.cpp/build/bin/libllama.so(llama_init_from_model+0xda)[0x787979ca9faa]
./build/bin/llama-server(+0x1ebe06)[0x5888b5638e06]
./build/bin/llama-server(+0xbb006)[0x5888b5508006]
./build/bin/llama-server(+0x55358)[0x5888b54a2358]
/lib/x86_64-linux-gnu/libc.so.6(+0x2a1ca)[0x78797942a1ca]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x8b)[0x78797942a28b]
./build/bin/llama-server(+0x578a5)[0x5888b54a48a5]
Aborted (core dumped)

Next, I'll personally audit llama-sparse-mla-fwd.cpp and the corresponding test to see if I can figure out what's going on there.

createthis avatar Oct 15 '25 19:10 createthis