ggml-cpu: handle 3d tensors in repack mat_mul
While testing #16739, perplexities for LFM2 skyrocketed. @ggerganov pointed out that some matrix shapes would probably not be supported.
LFM2 has some layers that have two batches, so MAT_MULs were only done partially, leading to incorrect results. See https://github.com/ggml-org/llama.cpp/pull/16739#issuecomment-3472806066
This patch adds basic support for tensors with ne2 > 1, with very naive chunking based on the non repack MUL MAT.
Perplexities using this patch:
# REPACK ON
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
# REPACK OFF
[1]9.9763,[2]15.1558,[3]13.9708,[4]13.7465,[5]14.2039,[6]14.6234,[7]14.6543,[8]15.6984,[9]16.7691,[10]17.1773,[11]16.9814,[12]17.2111,[13]17.7539,[14]17.2013,[15]16.8515,[16]16.9276,[17]15.8386,[18]16.1010,[19]15.9863,[20]15.7344,
Final estimate: PPL = 15.7344 +/- 0.63198
I can provide logs for other models if needed.
Repro commands:
# GGML_CPU_REPACK=ON|OFF GGML_BLAS=OFF GGML_METAL=OFF
for model in "unsloth/Qwen3-8B-128K-GGUF:Q4_0 LiquidAI/LFM2-1.2B-GGUF:Q4_0 LiquidAI/LFM2-2.6B-GGUF:Q4_0"; do
./bin/llama-perplexity -hf $model -f ./wikitext-2-raw/wiki.test.raw --chunks 100 -dev none
done
Other models:
# Qwen 3 Repack
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427
# Qwen 3 REPACK OFF
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt-[1]7.6803,[2]10.1811,[3]9.4260,[4]9.0666,[5]9.2647,[6]9.6980,[7]9.8774,[8]10.4100,[9]10.9424,[10]11.4185,[11]11.4938,[12]11.6893,[13]12.1807,[14]11.7433,[15]11.5808,[16]11.7468,[17]11.0987,[18]11.2603,[19]11.0962,[20]11.1735,[21]10.8974,[22]10.9181,[23]10.4976,[24]9.9920,[25]9.7800,[26]9.5234,[27]9.2917,[28]9.1358,[29]9.1840,[30]9.1386,[31]9.1237,[32]9.1164,[33]8.9839,[34]9.0213,[35]9.0949,[36]9.2154,[37]9.3887,[38]9.4512,[39]9.4129,[40]9.4693,[41]9.4650,[42]9.3915,[43]9.4305,[44]9.4552,[45]9.4605,[46]9.4598,[47]9.6747,[48]9.7829,[49]9.7476,[50]9.8248,[51]9.8489,[52]9.8696,[53]9.9087,[54]9.9802,[55]10.0069,[56]10.0701,[57]10.0272,[58]10.0532,[59]10.1151,[60]10.1645,[61]10.1961,[62]10.2441,[63]10.3282,[64]10.3811,[65]10.4620,[66]10.5540,[67]10.6382,[68]10.6259,[69]10.6246,[70]10.6129,[71]10.6290,[72]10.6951,[73]10.7189,[74]10.7327,[75]10.6625,[76]10.6244,[77]10.6562,[78]10.6942,[79]10.6103,[80]10.5880,[81]10.5408,[82]10.5761,[83]10.5308,[84]10.5104,[85]10.5348,[86]10.6326,[87]10.6827,[88]10.6733,[89]10.6861,[90]10.6783,[91]10.7371,[92]10.6980,[93]10.7394,[94]10.7430,[95]10.7241,[96]10.7199,[97]10.6880,[98]10.6990,[99]10.6692,[100]10.7254,
perplexities_build-cpu-aarm64-norepack_Qwen3-8B-128K-GGUF:Q4_0.txt:Final estimate: PPL = 10.7254 +/- 0.20427
# LFM2 REPACK
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849
# LFM2 REPACK OFF
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt-[1]7.0724,[2]11.2417,[3]11.3736,[4]11.0566,[5]11.2978,[6]11.8576,[7]12.1547,[8]12.8728,[9]13.8226,[10]14.0957,[11]13.6415,[12]13.7865,[13]14.1242,[14]13.5275,[15]13.2750,[16]13.1469,[17]12.3869,[18]12.5628,[19]12.4196,[20]12.2570,[21]11.8653,[22]11.8657,[23]11.5625,[24]11.3099,[25]11.2837,[26]11.0172,[27]10.9685,[28]10.9421,[29]10.8844,[30]11.0062,[31]10.9984,[32]11.1214,[33]11.0812,[34]11.0926,[35]11.0572,[36]11.1630,[37]11.3042,[38]11.1564,[39]11.3252,[40]11.2555,[41]11.2296,[42]11.2722,[43]11.3182,[44]11.2066,[45]11.2418,[46]11.3877,[47]11.5001,[48]11.4392,[49]11.4613,[50]11.5636,[51]11.5742,[52]11.5927,[53]11.6412,[54]11.6469,[55]11.7139,[56]11.7273,[57]11.7956,[58]11.8651,[59]11.9185,[60]11.9757,[61]11.9816,[62]12.0535,[63]12.1499,[64]12.2589,[65]12.3879,[66]12.4853,[67]12.4684,[68]12.4438,[69]12.4475,[70]12.4592,[71]12.5043,[72]12.5274,[73]12.5598,[74]12.5025,[75]12.4682,[76]12.4976,[77]12.5186,[78]12.4596,[79]12.3959,[80]12.3615,[81]12.4195,[82]12.4745,[83]12.4321,[84]12.4450,[85]12.5002,[86]12.5583,[87]12.5979,[88]12.5772,[89]12.5398,[90]12.5321,[91]12.4828,[92]12.5500,[93]12.5727,[94]12.5613,[95]12.5658,[96]12.5653,[97]12.5379,[98]12.5156,[99]12.5447,[100]12.5589,
perplexities_build-cpu-aarm64-norepack_LFM2-2.6B-GGUF:Q4_0.txt:Final estimate: PPL = 12.5589 +/- 0.21849
@ggerganov This is ready for review now. Thanks for your patience.
@ggerganov I've addressed all your comments. Let me know if something else is required.
@Alcpz This PR causes significant performance regression for Prompt processing because it creates a lot more chunks than before.
Here is llama3.2-1B-Q4_0 running with 6 threads with instrumented matmul code.
The instrumentation simply counts number of processed chunks and the time per thread repack-chunking-inst.diff.txt
After this PR Before this PR
thread-2: Qcur-11 nchunks 38 usec 1844 thread-4: Qcur-11 nchunks 6 usec 1496
thread-3: Qcur-11 nchunks 38 usec 1844 thread-0: Qcur-11 nchunks 6 usec 1498
thread-4: Qcur-11 nchunks 17 usec 1874 thread-5: Qcur-11 nchunks 3 usec 1597
thread-5: Qcur-11 nchunks 17 usec 1948 thread-1: Qcur-11 nchunks 3 usec 1640
thread-1: Qcur-11 nchunks 17 usec 1894 thread-2: Qcur-11 nchunks 3 usec 1685
thread-0: Qcur-11 nchunks 17 usec 1876 thread-3: Qcur-11 nchunks 3 usec 1718
thread-4: Vcur-11 nchunks 17 usec 607 thread-5: Vcur-11 nchunks 6 usec 508
thread-5: Vcur-11 nchunks 17 usec 638 thread-4: Vcur-11 nchunks 6 usec 515
thread-2: Vcur-11 nchunks 39 usec 617 thread-0: Vcur-11 nchunks 3 usec 547
thread-1: Vcur-11 nchunks 15 usec 618 thread-2: Vcur-11 nchunks 3 usec 548
thread-0: Vcur-11 nchunks 17 usec 630 thread-1: Vcur-11 nchunks 3 usec 564
thread-3: Vcur-11 nchunks 39 usec 617 thread-3: Vcur-11 nchunks 3 usec 596
thread-5: Kcur-11 nchunks 38 usec 611 thread-5: Kcur-11 nchunks 6 usec 484
thread-1: Kcur-11 nchunks 17 usec 615 thread-0: Kcur-11 nchunks 6 usec 490
thread-0: Kcur-11 nchunks 17 usec 617 thread-1: Kcur-11 nchunks 3 usec 547
thread-2: Kcur-11 nchunks 17 usec 628 thread-3: Kcur-11 nchunks 3 usec 548
thread-4: Kcur-11 nchunks 38 usec 611 thread-4: Kcur-11 nchunks 3 usec 557
thread-3: Kcur-11 nchunks 17 usec 649 thread-2: Kcur-11 nchunks 3 usec 547
thread-3: attn_out-11 nchunks 38 usec 1835 thread-4: attn_out-11 nchunks 6 usec 1567
thread-5: attn_out-11 nchunks 38 usec 1847 thread-5: attn_out-11 nchunks 6 usec 1569
thread-0: attn_out-11 nchunks 17 usec 1880 thread-1: attn_out-11 nchunks 3 usec 1637
thread-4: attn_out-11 nchunks 17 usec 1886 thread-2: attn_out-11 nchunks 3 usec 1639
thread-1: attn_out-11 nchunks 17 usec 1890 thread-3: attn_out-11 nchunks 3 usec 1642
thread-2: attn_out-11 nchunks 17 usec 1897 thread-0: attn_out-11 nchunks 3 usec 1649
thread-3: ffn_gate-11 nchunks 38 usec 4886 thread-5: ffn_gate-11 nchunks 6 usec 4103
thread-2: ffn_gate-11 nchunks 38 usec 4887 thread-4: ffn_gate-11 nchunks 6 usec 4141
thread-5: ffn_gate-11 nchunks 17 usec 4992 thread-0: ffn_gate-11 nchunks 3 usec 4298
thread-1: ffn_gate-11 nchunks 17 usec 5010 thread-1: ffn_gate-11 nchunks 3 usec 4357
thread-4: ffn_gate-11 nchunks 17 usec 5010 thread-2: ffn_gate-11 nchunks 3 usec 4373
thread-0: ffn_gate-11 nchunks 17 usec 5032 thread-3: ffn_gate-11 nchunks 3 usec 4447
thread-5: ffn_up-11 nchunks 38 usec 4908 thread-0: ffn_up-11 nchunks 6 usec 4107
thread-3: ffn_up-11 nchunks 38 usec 4909 thread-5: ffn_up-11 nchunks 6 usec 4129
thread-4: ffn_up-11 nchunks 17 usec 5000 thread-1: ffn_up-11 nchunks 3 usec 4362
thread-0: ffn_up-11 nchunks 17 usec 5005 thread-4: ffn_up-11 nchunks 3 usec 4377
thread-1: ffn_up-11 nchunks 17 usec 5008 thread-3: ffn_up-11 nchunks 3 usec 4400
thread-2: ffn_up-11 nchunks 17 usec 5037 thread-2: ffn_up-11 nchunks 3 usec 4381
thread-5: ffn_out-11 nchunks 38 usec 4924 thread-5: ffn_out-11 nchunks 6 usec 4089
thread-4: ffn_out-11 nchunks 38 usec 4928 thread-0: ffn_out-11 nchunks 6 usec 4089
thread-1: ffn_out-11 nchunks 17 usec 5006 thread-3: ffn_out-11 nchunks 3 usec 4386
thread-2: ffn_out-11 nchunks 17 usec 5010 thread-2: ffn_out-11 nchunks 3 usec 4386
thread-0: ffn_out-11 nchunks 17 usec 5011 thread-4: ffn_out-11 nchunks 3 usec 4414
thread-3: ffn_out-11 nchunks 17 usec 5023 thread-1: ffn_out-11 nchunks 3 usec 4391
That's way too many chunks and we burn a lot of time on syncrhonization. If you have an idea for a quick fix that you can test on LFM2 please start another PR and I'll verify on my setup. Make sure to test with Llama3.2 and Qwen3 models with instrumented code.
Mmm. Let's revert this then. I will reopen a PR with the branch as a draft and we can have a better solution. I'd rather not introduce a regression upstream. @ggerganov Mind doing the revert?