gemma.cpp
gemma.cpp copied to clipboard
Near-term roadmap
We're sharing a roadmap of ideas for improving and speeding up Gemma. If you'd like to join in and help us get there faster, please reach out so we can coordinate :)
Threading
- [x] (jan-wassenberg) detect: total #logical, per-logical: package, chiplet, core, smt
- [x] (jan-wassenberg) detect: CPU name, L2D/L3 size
- [x] (Z.A.) CCX-aware pinning - ready, awaiting Highway 1.2 release
- [ ] (jan-wassenberg) more efficient ThreadPool
- [x] command line arg to disable pinning
- [x] detect NUMA
[x] Dot product
- [x] Add
_mm*_dpbf16_ps
toHWY_AVX3_SPR
andHWY_AVX3_ZEN4
targets, plus defineHWY_NATIVE_DOT_BF16
inset_macros-inl.h
- [x] Faster SFP decode via table lookup
- [x] Add new
NEON_*
target that usesvbfdot
forReorderWidenMulAccumulate
- [x] If
!defined(HWY_NATIVE_DOT_BF16) || !HWY_NATIVE_DOT_BF16
, decompress bf16->f32 to temp array before MatVec (idea by Samuel, thank you!) - in #166 - [x] Apply even/odd trick to SFP
Matmul
- [x] (pculliton) implement basic matmul and test. Not using BLAS because we want to fuse matmul and decompression.
- [x] (pculliton) 4x4 unrolled and vectorized matmul
- [x] (szabadka, B.B.) Update Prefill to use matmul (activation @ weights) instead of MatVec. Almost there.
- [x] Fused decompression inside matmul
- [x] Support offsets within the matrix, required by some call sites
- [x] (jan-wassenberg) Decompress weights to bf16 when native
- [ ] (jan-wassenberg) Cache-aware tiling/packing
- [ ] (jan-wassenberg) NUMA aware
- [ ] (jan-wassenberg) 64-bit precision
- [x] (B.B.) Larger batch size
- [x] (A.V.) Avoid allocations for decompression
Compression
- [x] (pculliton, A.R.) Eval infrastructure
- [x] (A.R.) Arbiter model for eval
- [ ] (Ray) add metadata to tensors, add TOC to BlobStore, remove RawWeights
- [ ] decide whether NUQ is launchable
Optimizations
- [x] Replace attention matVec with matmul - requires reshaping a matrix
- [ ] Convert f32 activations to bf16 beforehand if
HWY_NATIVE_DOT_BF16
- [ ] Integrate wraparound support into matmul
- [x] Fuse softmax and sampling
- [ ] Vectorize RoPE
- [ ] Faster/more accurate hwy/contrib/math functions by updating the polynomials
- [ ] Vectorize RMSNorm
- [ ] (A.R.?, ...) Smaller KVCache: bf16, possibly reorder for better locality
Usability
- [ ] warn if unknown arguments given. std::map of known arg names?
- [ ] infer model/weight type from weights filename, to avoid requiring extra flags
- [x] multiple .cc files to speed up builds
- [ ] Actionable error codes as return values: kLoadFailed, kSeqTooShort
- [x] move eval/test files to tests/
- [ ] Ctrl+C signal handler to ensure profiler results are printed without requiring %q input
- [ ] add --prompt flag to run.cc
- [ ] random prompt generation for debug_prompt.cc
- [ ] store ModelInfo in weights BlobStore
New models
- [x] (Daniel) Support PaliGemma
[x] General infra
- [x] (pculliton) Python wrapper
- [x] (pculliton, ...) Improved CI: run on Kaggle infra
- [x] AuxOut to hold timing info instead of printing in GenerateImpl.
- [x] Sampling struct holds rng and temperature, to reduce length of args
- [x] (P. C.) use new HWY_EXPORT_T to simplify dispatch - ready, awaiting Highway 1.2 release
Making good progress :)
Is Paligemma part of the scope of gemma.cpp?
Let's discuss in #185 :)