compilade
compilade
> [!NOTE] > Some changes made between now and merging could possibly require re-converting Mamba models to GGUF. > I'll announce it in this note if/when it happens. This should...
As promised in , I've been extracting the advanced batch splits out of the Jamba PR (#7531). I've also backported the contiguous allocation of recurrent state slots, which makes it...
This implements dequantization in Python (using Numpy) for `Q4_0`, `Q4_1`, `Q5_0`, `Q5_1`, `Q2_K`, `Q3_K`, `Q4_K`, `Q5_K`, `Q6_K`, `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS`, `IQ3_S`, `IQ1_S`, `IQ1_M`, `IQ4_NL`, and `IQ4_XS`, resulting in the...
This adds `1.6875 bpw` and `2.0625 bpw` quant types for TriLMs and BitNet b1.58 models. For now, these are named `TQ1_0` and `TQ2_0`, respectively. I had given glimpses of this...
Follow-up from . This should fix #7727 and fix #8519. I've implemented [the fully recurrent mode](https://github.com/state-spaces/mamba/blob/62db608da60f6fc790b8ed9f4b3225e95ca15fde/mamba_ssm/modules/mamba2.py#L311-L322) of Mamba-2, because it's very similar to Mamba-1, and also because it seems like...
This adds support for Jamba (fixes #6372). () To complement `llama_kv_cache`, I propose to add `llama_rs_cache`, as well as a top-level `llama_past` to more easily manage both at once. The...
Follow-up from . Using GGUF as the format for `imatrix` files will be useful for further experiments (e.g. with [L²QER](https://github.com/ggerganov/llama.cpp/discussions/8831)) and compatibility with existing or future GGUF tooling (e.g. GGUF...
I'm using Nix devShells for my development, most often with `nix develop .#default-extra`. # Problem I wanted to use `gguf-dump` with some model using the wrapper which that devShell puts...
This makes `GGUFWriter.write_tensors_to_file` use a thread pool to write the tensors in parallel. This should help make conversion faster, since otherwise nothing else happened when reading, processing or writing the...