llama.cpp
llama.cpp copied to clipboard
CPU SIMD and pipeline optimizations across vec/mmq/ops/kv-cache/repack
Summary
I was really bored in some lectures last week so i scoured through the repo for some optimisable/improvable parts so this PR accelerates multiple hot paths in ggml-cpu via multi‑ISA SIMD, better threading/cache locality and tighter inner loops. Touches vector activations, quantization, normalization kernels, KV cache, and repack paths.
- Vector ops: SIMD hardswish/hardsigmoid; improved trailing‑element handling
- Matmul/quant: parallelized A‑quant for cache locality and core utilization
- Norms: SIMD reductions in RMSNorm (fwd/bwd), GroupNorm, L2 norm
- KV cache: reordered conditions, hoisted invariants, simplified mask generation
- Repack: SIMD absolute‑max for generic quant flows (Q8_0 4x4/4x8, Q8_K 4x8)
Architectures: AVX512/AVX2/SSE2 (x86), NEON/SVE (ARM), RVV (RISC‑V), with scalar fallbacks.
Changes by area
- vec (
ggml/src/ggml-cpu/vec.cpp,vec.h)- Added SIMD implementations for hardswish/hardsigmoid across ISAs
- Reduced overhead for tails (clean scalar tails or single‑width fallbacks)
- mmq (
mmq.cpp)- Parallelized quantization of matrix A; chunked to preserve locality, reduce contention
- ops (
ggml/src/ggml-cpu/ops.cpp)- RMSNorm forward: SIMD sum‑of‑squares
- RMSNorm backward: SIMD for sum‑of‑squares + dot
- L2 norm: SIMD reduction
- GroupNorm: SIMD sum and sum‑of‑squares
- KV cache (
llama-kv-cache.cpp)- Condition reordering for better branch prediction
- Hoisted frequently accessed values outside inner loops
- Simplified mask generation logic
- repack (
ggml/src/ggml-cpu/repack.cpp)- SIMD absolute‑max in generic quant functions for Q8 paths
Performance (CPU backend)
A/B vs prior commit (53d7d21e6) shows:
- ADD_ID: up to ~3.8x (shape‑dependent), commonly 1.3–2.0x
- MUL_MAT / MUL_MAT_ID (quantized paths): many cases 1.2–3.0x; f16/f32 often +5–30%
- FLASH_ATTN_EXT: frequent 1.2–1.7x gains; a few small‑shape regressions
- PAD_REFLECT_1D: ~2–6x
- CPY / SOFT_MAX / CONV2D: mixed; many +5–30%, some regressions (−10–40%) on specific shapes
I suggest using Add suggestion to batch and commit them all at once (from Files changed).