compilade issues

Results 9 issues of


                                            compilade

llama : support Mamba Selective State Space Models

> [!NOTE] > Some changes made between now and merging could possibly require re-converting Mamba models to GGUF. > I'll announce it in this note if/when it happens. This should...

llama : simplify Mamba with advanced batch splits

As promised in , I've been extracting the advanced batch splits out of the Jamba PR (#7531). I've also backported the contiguous allocation of recurrent state slots, which makes it...

testing

refactoring

Review Complexity : Medium

ggml

gguf-py : Numpy dequantization for most types

This implements dequantization in Python (using Numpy) for `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q2_K`, `Q3_K`, `Q4_K`, `Q5_K`, `Q6_K`, `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS`, `IQ3_S`, `IQ1_S`, `IQ1_M`, `IQ4_NL`, and `IQ4_XS`, resulting in the...

enhancement

python

Review Complexity : Medium

merge ready

Tensor Encoding Scheme

ggml-quants : ternary packing for TriLMs and BitNet b1.58

This adds `1.6875 bpw` and `2.0625 bpw` quant types for TriLMs and BitNet b1.58 models. For now, these are named `TQ1_0` and `TQ2_0`, respectively. I had given glimpses of this...

enhancement

testing

examples

python

Review Complexity : High

ggml

Tensor Encoding Scheme

llama : initial Mamba-2 support

Follow-up from . This should fix #7727 and fix #8519. I've implemented [the fully recurrent mode](https://github.com/state-spaces/mamba/blob/62db608da60f6fc790b8ed9f4b3225e95ca15fde/mamba_ssm/modules/mamba2.py#L311-L322) of Mamba-2, because it's very similar to Mamba-1, and also because it seems like...

testing

python

Review Complexity : Medium

ggml

Apple Metal

llama : support Jamba hybrid Transformer-Mamba models

This adds support for Jamba (fixes #6372). () To complement `llama_kv_cache`, I propose to add `llama_rs_cache`, as well as a top-level `llama_past` to more easily manage both at once. The...

enhancement

model

android

refactoring

need feedback

examples

embeddings

python

Review Complexity : High

server

ggml

imatrix : use GGUF to store importance matrices

Follow-up from . Using GGUF as the format for `imatrix` files will be useful for further experiments (e.g. with [L²QER](https://github.com/ggerganov/llama.cpp/discussions/8831)) and compatibility with existing or future GGUF tooling (e.g. GGUF...

enhancement

breaking change

refactoring

examples

python

Review Complexity : Medium

gguf-py : avoid requiring PySide6 for packaged scripts

I'm using Nix devShells for my development, most often with `nix develop .#default-extra`. # Problem I wanted to use `gguf-dump` with some model using the wrapper which that devShell puts...

nix

bugfix

python

devops

convert : write tensors in parallel

This makes `GGUFWriter.write_tensors_to_file` use a thread pool to write the tensors in parallel. This should help make conversion faster, since otherwise nothing else happened when reading, processing or writing the...

performance

python