llama.cpp ggml-cpu: handle 3d tensors in repack mat

While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.

LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See https://github.com/ggml-org/llama.cpp/pull/16739#issuecomment-3472806066

This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.

Perplexities using this patch:

# REPACK ON
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
# REPACK OFF
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198

I can provide logs for other models if needed.

Repro commands:

# GGML_CPU_REPACK=ON|OFF GGML_BLAS=OFF GGML_METAL=OFF

for model in "unsloth/Qwen3-8B-128K-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0 LiquidAI/LFM2-2.6B-GGUF:Q4_0"; do
  ./bin/llama-perplexity -hf $model -f ./wikitext-2-raw/wiki.test.raw --chunks 100 -dev none
done

Other models:

# Qwen 3 Repack
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427

# Qwen 3 REPACK OFF
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427

# LFM2 REPACK
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849

# LFM2 REPACK OFF
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849

Nov 05 '25 16:11 Alcpz

@ggerganov This is ready for review now. Thanks for your patience.

Nov 10 '25 10:11 Alcpz

@ggerganov I've addressed all your comments. Let me know if something else is required.

Nov 12 '25 10:11 Alcpz

@Alcpz This PR causes significant performance regression for Prompt processing because it creates a lot more chunks than before.

Here is llama3.2-1B-Q4_0 running with 6 threads with instrumented matmul code.

The instrumentation simply counts number of processed chunks and the time per thread repack-chunking-inst.diff.txt

After this PR                                 Before this PR
thread-2: Qcur-11 nchunks 38 usec 1844        thread-4: Qcur-11 nchunks 6 usec 1496
thread-3: Qcur-11 nchunks 38 usec 1844        thread-0: Qcur-11 nchunks 6 usec 1498
thread-4: Qcur-11 nchunks 17 usec 1874        thread-5: Qcur-11 nchunks 3 usec 1597
thread-5: Qcur-11 nchunks 17 usec 1948        thread-1: Qcur-11 nchunks 3 usec 1640
thread-1: Qcur-11 nchunks 17 usec 1894        thread-2: Qcur-11 nchunks 3 usec 1685
thread-0: Qcur-11 nchunks 17 usec 1876        thread-3: Qcur-11 nchunks 3 usec 1718
thread-4: Vcur-11 nchunks 17 usec 607         thread-5: Vcur-11 nchunks 6 usec 508
thread-5: Vcur-11 nchunks 17 usec 638         thread-4: Vcur-11 nchunks 6 usec 515
thread-2: Vcur-11 nchunks 39 usec 617         thread-0: Vcur-11 nchunks 3 usec 547
thread-1: Vcur-11 nchunks 15 usec 618         thread-2: Vcur-11 nchunks 3 usec 548
thread-0: Vcur-11 nchunks 17 usec 630         thread-1: Vcur-11 nchunks 3 usec 564
thread-3: Vcur-11 nchunks 39 usec 617         thread-3: Vcur-11 nchunks 3 usec 596
thread-5: Kcur-11 nchunks 38 usec 611         thread-5: Kcur-11 nchunks 6 usec 484
thread-1: Kcur-11 nchunks 17 usec 615         thread-0: Kcur-11 nchunks 6 usec 490
thread-0: Kcur-11 nchunks 17 usec 617         thread-1: Kcur-11 nchunks 3 usec 547
thread-2: Kcur-11 nchunks 17 usec 628         thread-3: Kcur-11 nchunks 3 usec 548
thread-4: Kcur-11 nchunks 38 usec 611         thread-4: Kcur-11 nchunks 3 usec 557
thread-3: Kcur-11 nchunks 17 usec 649         thread-2: Kcur-11 nchunks 3 usec 547
thread-3: attn_out-11 nchunks 38 usec 1835    thread-4: attn_out-11 nchunks 6 usec 1567
thread-5: attn_out-11 nchunks 38 usec 1847    thread-5: attn_out-11 nchunks 6 usec 1569
thread-0: attn_out-11 nchunks 17 usec 1880    thread-1: attn_out-11 nchunks 3 usec 1637
thread-4: attn_out-11 nchunks 17 usec 1886    thread-2: attn_out-11 nchunks 3 usec 1639
thread-1: attn_out-11 nchunks 17 usec 1890    thread-3: attn_out-11 nchunks 3 usec 1642
thread-2: attn_out-11 nchunks 17 usec 1897    thread-0: attn_out-11 nchunks 3 usec 1649
thread-3: ffn_gate-11 nchunks 38 usec 4886    thread-5: ffn_gate-11 nchunks 6 usec 4103
thread-2: ffn_gate-11 nchunks 38 usec 4887    thread-4: ffn_gate-11 nchunks 6 usec 4141
thread-5: ffn_gate-11 nchunks 17 usec 4992    thread-0: ffn_gate-11 nchunks 3 usec 4298
thread-1: ffn_gate-11 nchunks 17 usec 5010    thread-1: ffn_gate-11 nchunks 3 usec 4357
thread-4: ffn_gate-11 nchunks 17 usec 5010    thread-2: ffn_gate-11 nchunks 3 usec 4373
thread-0: ffn_gate-11 nchunks 17 usec 5032    thread-3: ffn_gate-11 nchunks 3 usec 4447
thread-5: ffn_up-11 nchunks 38 usec 4908      thread-0: ffn_up-11 nchunks 6 usec 4107
thread-3: ffn_up-11 nchunks 38 usec 4909      thread-5: ffn_up-11 nchunks 6 usec 4129
thread-4: ffn_up-11 nchunks 17 usec 5000      thread-1: ffn_up-11 nchunks 3 usec 4362
thread-0: ffn_up-11 nchunks 17 usec 5005      thread-4: ffn_up-11 nchunks 3 usec 4377
thread-1: ffn_up-11 nchunks 17 usec 5008      thread-3: ffn_up-11 nchunks 3 usec 4400
thread-2: ffn_up-11 nchunks 17 usec 5037      thread-2: ffn_up-11 nchunks 3 usec 4381
thread-5: ffn_out-11 nchunks 38 usec 4924     thread-5: ffn_out-11 nchunks 6 usec 4089
thread-4: ffn_out-11 nchunks 38 usec 4928     thread-0: ffn_out-11 nchunks 6 usec 4089
thread-1: ffn_out-11 nchunks 17 usec 5006     thread-3: ffn_out-11 nchunks 3 usec 4386
thread-2: ffn_out-11 nchunks 17 usec 5010     thread-2: ffn_out-11 nchunks 3 usec 4386
thread-0: ffn_out-11 nchunks 17 usec 5011     thread-4: ffn_out-11 nchunks 3 usec 4414
thread-3: ffn_out-11 nchunks 17 usec 5023     thread-1: ffn_out-11 nchunks 3 usec 4391

That's way too many chunks and we burn a lot of time on syncrhonization. If you have an idea for a quick fix that you can test on LFM2 please start another PR and I'll verify on my setup. Make sure to test with Llama3.2 and Qwen3 models with instrumented code.

Nov 13 '25 03:11 max-krasnyansky

Mmm. Let's revert this then. I will reopen a PR with the branch as a draft and we can have a better solution. I'd rather not introduce a regression upstream. @ggerganov Mind doing the revert?

Nov 13 '25 10:11 Alcpz

ggml-cpu: handle 3d tensors in repack mat_mul