gemma.cpp Near-term roadmap

We're sharing a roadmap of ideas for improving and speeding up Gemma. If you'd like to join in and help us get there faster, please reach out so we can coordinate :)

Threading

[x] (jan-wassenberg) detect: total #logical, per-logical: package, chiplet, core, smt
[x] (jan-wassenberg) detect: CPU name, L2D/L3 size
[x] (Z.A.) CCX-aware pinning - ready, awaiting Highway 1.2 release
[ ] (jan-wassenberg) more efficient ThreadPool
[x] command line arg to disable pinning
[x] detect NUMA

[x] Dot product

[x] Add _mm*_dpbf16_ps to HWY_AVX3_SPR and HWY_AVX3_ZEN4 targets, plus define HWY_NATIVE_DOT_BF16 in set_macros-inl.h
[x] Faster SFP decode via table lookup
[x] Add new NEON_* target that uses vbfdot for ReorderWidenMulAccumulate
[x] If !defined(HWY_NATIVE_DOT_BF16) || !HWY_NATIVE_DOT_BF16, decompress bf16->f32 to temp array before MatVec (idea by Samuel, thank you!) - in #166
[x] Apply even/odd trick to SFP

Matmul

[x] (pculliton) implement basic matmul and test. Not using BLAS because we want to fuse matmul and decompression.
[x] (pculliton) 4x4 unrolled and vectorized matmul
[x] (szabadka, B.B.) Update Prefill to use matmul (activation @ weights) instead of MatVec. Almost there.
[x] Fused decompression inside matmul
[x] Support offsets within the matrix, required by some call sites
[x] (jan-wassenberg) Decompress weights to bf16 when native
[ ] (jan-wassenberg) Cache-aware tiling/packing
[ ] (jan-wassenberg) NUMA aware
[ ] (jan-wassenberg) 64-bit precision
[x] (B.B.) Larger batch size
[x] (A.V.) Avoid allocations for decompression

Compression

[x] (pculliton, A.R.) Eval infrastructure
[x] (A.R.) Arbiter model for eval
[ ] (Ray) add metadata to tensors, add TOC to BlobStore, remove RawWeights
[ ] decide whether NUQ is launchable

Optimizations

[x] Replace attention matVec with matmul - requires reshaping a matrix
[ ] Convert f32 activations to bf16 beforehand if HWY_NATIVE_DOT_BF16
[ ] Integrate wraparound support into matmul
[x] Fuse softmax and sampling
[ ] Vectorize RoPE
[ ] Faster/more accurate hwy/contrib/math functions by updating the polynomials
[ ] Vectorize RMSNorm
[ ] (A.R.?, ...) Smaller KVCache: bf16, possibly reorder for better locality

Usability

[ ] warn if unknown arguments given. std::map of known arg names?
[ ] infer model/weight type from weights filename, to avoid requiring extra flags
[x] multiple .cc files to speed up builds
[ ] Actionable error codes as return values: kLoadFailed, kSeqTooShort
[x] move eval/test files to tests/
[ ] Ctrl+C signal handler to ensure profiler results are printed without requiring %q input
[ ] add --prompt flag to run.cc
[ ] random prompt generation for debug_prompt.cc
[ ] store ModelInfo in weights BlobStore

New models

[x] (Daniel) Support PaliGemma

[x] General infra

[x] (pculliton) Python wrapper
[x] (pculliton, ...) Improved CI: run on Kaggle infra
[x] AuxOut to hold timing info instead of printing in GenerateImpl.
[x] Sampling struct holds rng and temperature, to reduce length of args
[x] (P. C.) use new HWY_EXPORT_T to simplify dispatch - ready, awaiting Highway 1.2 release

Apr 26 '24 05:04 jan-wassenberg

Making good progress :)

May 03 '24 13:05 jan-wassenberg

Is Paligemma part of the scope of gemma.cpp?

May 14 '24 18:05 MathiasSchindler

Let's discuss in #185 :)

May 15 '24 07:05 jan-wassenberg

gemma.cpp gemma.cpp copied to clipboard

Near-term roadmap

Threading

[x] Dot product

Matmul

Compression

Optimizations

Usability

New models

[x] General infra

gemma.cpp
gemma.cpp copied to clipboard