CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack

Open NoahOksuz opened this issue 1 month ago • 1 comments

Summary

I was really bored in some lectures last week so i scoured through the repo for some optimisable/improvable parts so this PR accelerates multiple hot paths in ggml-cpu via multi‑ISA SIMD, better threading/cache locality and tighter inner loops. Touches vector activations, quantization, normalization kernels, KV cache, and repack paths.

Vector ops: SIMD hardswish/hardsigmoid; improved trailing‑element handling
Matmul/quant: parallelized A‑quant for cache locality and core utilization
Norms: SIMD reductions in RMSNorm (fwd/bwd), GroupNorm, L2 norm
KV cache: reordered conditions, hoisted invariants, simplified mask generation
Repack: SIMD absolute‑max for generic quant flows (Q8_0 4x4/4x8, Q8_K 4x8)

Architectures: AVX512/AVX2/SSE2 (x86), NEON/SVE (ARM), RVV (RISC‑V), with scalar fallbacks.

Changes by area

vec (ggml/src/ggml-cpu/vec.cpp, vec.h)
- Added SIMD implementations for hardswish/hardsigmoid across ISAs
- Reduced overhead for tails (clean scalar tails or single‑width fallbacks)
mmq (mmq.cpp)
- Parallelized quantization of matrix A; chunked to preserve locality, reduce contention
ops (ggml/src/ggml-cpu/ops.cpp)
- RMSNorm forward: SIMD sum‑of‑squares
- RMSNorm backward: SIMD for sum‑of‑squares + dot
- L2 norm: SIMD reduction
- GroupNorm: SIMD sum and sum‑of‑squares
KV cache (llama-kv-cache.cpp)
- Condition reordering for better branch prediction
- Hoisted frequently accessed values outside inner loops
- Simplified mask generation logic
repack (ggml/src/ggml-cpu/repack.cpp)
- SIMD absolute‑max in generic quant functions for Q8 paths

Performance (CPU backend)

A/B vs prior commit (53d7d21e6) shows:

ADD_ID: up to ~3.8x (shape‑dependent), commonly 1.3–2.0x
MUL_MAT / MUL_MAT_ID (quantized paths): many cases 1.2–3.0x; f16/f32 often +5–30%
FLASH_ATTN_EXT: frequent 1.2–1.7x gains; a few small‑shape regressions
PAD_REFLECT_1D: ~2–6x
CPY / SOFT_MAX / CONV2D: mixed; many +5–30%, some regressions (−10–40%) on specific shapes

Nov 08 '25 21:11 NoahOksuz

I suggest using Add suggestion to batch and commit them all at once (from Files changed).

Nov 08 '25 22:11 CISC