llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack

Open NoahOksuz opened this issue 1 month ago • 1 comments

Summary

I was really bored in some lectures last week so i scoured through the repo for some optimisable/improvable parts so this PR accelerates multiple hot paths in ggml-cpu via multi‑ISA SIMD, better threading/cache locality and tighter inner loops. Touches vector activations, quantization, normalization kernels, KV cache, and repack paths.

  • Vector ops: SIMD hardswish/hardsigmoid; improved trailing‑element handling
  • Matmul/quant: parallelized A‑quant for cache locality and core utilization
  • Norms: SIMD reductions in RMSNorm (fwd/bwd), GroupNorm, L2 norm
  • KV cache: reordered conditions, hoisted invariants, simplified mask generation
  • Repack: SIMD absolute‑max for generic quant flows (Q8_0 4x4/4x8, Q8_K 4x8)

Architectures: AVX512/AVX2/SSE2 (x86), NEON/SVE (ARM), RVV (RISC‑V), with scalar fallbacks.

Changes by area

  • vec (ggml/src/ggml-cpu/vec.cpp, vec.h)
    • Added SIMD implementations for hardswish/hardsigmoid across ISAs
    • Reduced overhead for tails (clean scalar tails or single‑width fallbacks)
  • mmq (mmq.cpp)
    • Parallelized quantization of matrix A; chunked to preserve locality, reduce contention
  • ops (ggml/src/ggml-cpu/ops.cpp)
    • RMSNorm forward: SIMD sum‑of‑squares
    • RMSNorm backward: SIMD for sum‑of‑squares + dot
    • L2 norm: SIMD reduction
    • GroupNorm: SIMD sum and sum‑of‑squares
  • KV cache (llama-kv-cache.cpp)
    • Condition reordering for better branch prediction
    • Hoisted frequently accessed values outside inner loops
    • Simplified mask generation logic
  • repack (ggml/src/ggml-cpu/repack.cpp)
    • SIMD absolute‑max in generic quant functions for Q8 paths

Performance (CPU backend)

A/B vs prior commit (53d7d21e6) shows:

  • ADD_ID: up to ~3.8x (shape‑dependent), commonly 1.3–2.0x
  • MUL_MAT / MUL_MAT_ID (quantized paths): many cases 1.2–3.0x; f16/f32 often +5–30%
  • FLASH_ATTN_EXT: frequent 1.2–1.7x gains; a few small‑shape regressions
  • PAD_REFLECT_1D: ~2–6x
  • CPY / SOFT_MAX / CONV2D: mixed; many +5–30%, some regressions (−10–40%) on specific shapes

NoahOksuz avatar Nov 08 '25 21:11 NoahOksuz

I suggest using Add suggestion to batch and commit them all at once (from Files changed).

CISC avatar Nov 08 '25 22:11 CISC