llama.cpp
llama.cpp copied to clipboard
imatrix : use GGUF to store importance matrices
Follow-up from https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10465793.
Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) https://github.com/ggerganov/llama.cpp/issues/6715, some kind of gguf-diff, etc.).
There are multiple problems with imatrix which this is addressing:
- Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
- Non-deterministic tensor order depending on
unordered_mapiteration order (makessha256sumuseless to compareimatrixfiles made on the same dataset) - Broken behavior at small
-ub(intermediate saves happen waaay too often) - Can't use bigger batch size than chunk size
Summary of changes
- Use GGUF to store
imatrixdata.general.typeisimatrix- no
general.architecture- can't really know the architecture from old
imatrixfiles.
- can't really know the architecture from old
- store
*.sumsand*.countsfor each tensors with imatrix data.*.sumsare the sums of activations- Stored in
F32, like before.
- Stored in
*.countsare the number of activations (also the number of tokens), useful to calculate the mean- Why not simply store the mean? To allow merging
imatrixfiles together with--in-file. - It's stored in
F32even though it's integer values, because when calculating the mean it would be converted toF32anyway to perform the division.
- Why not simply store the mean? To allow merging
- Add
convert_legacy_imatrix_to_gguf.pyto convert oldimatrix.datfiles toimatrix.gguf - Like
llama-perplexitysince #5946, allow computing multiple chunks per batch withllama-imatrix- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
- Use fused-multiply-add (with
std::fma) when accumulating the sums of activations- Shouldn't hurt to somewhat reduce rounding errors
- (obviously
f64would be even better, but I'm not use it's worth it yet. For the curious, usingdoublefor the intermediate accumulations can be tried by changing only one line inIMatrixStats:vector<float> valuestovector<double> values.)
- (obviously
- Shouldn't hurt to somewhat reduce rounding errors
- Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of
unordered_map.- Determinism between runs means
sha256sumcan be meaningfully used to compareimatrixfiles generated in very similar conditions.
- Determinism between runs means
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of
TODO
- [ ] Compare old
llama-quantizewith oldimatrix.datwith newllama-quantizeusing convertedimatrix.gguf- Seemed to work, but need to re-test. The resulting quantized model(s) should have the same
sha256sum.
- Seemed to work, but need to re-test. The resulting quantized model(s) should have the same
- [x] Test new
llama-imatrixat different batch sizes- Same checksums with
-ub 64 -b 512and-ub 512 -b 2048for a chunk size of 512 (-c 512)
- Same checksums with
- [ ] Perplexity test(s) with i-quants with old
llama-imatrixvs newllama-imatrix - [ ] Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
- [ ] Test
--in-filewithllama-imatrix - [ ] (maybe) Implement cleaner
general.architectureexclusion.- Currently, this uses a subclass to make
self.add_architecture()a no-op, but maybegeneral.architectureshould simply be excluded whenself.arch == "". Not sure how to prevent using the otherself.add_*(inGGUFWriter) which expectself.archto be something. - Or maybe the architecture should be included?
- What about conversions from older
imatrix.datfiles?
- What about conversions from older
- Currently, this uses a subclass to make
- [x] I have read the contributing guidelines
- Self-reported review complexity:
- [x] Medium
I'm setting this to "draft", because of concerns by @ikawrakow in https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10615399 and https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10626253 (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).
More details near the end of https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10632253.
I'll need some days to think about how to go further with this.
@compilade This is a good change and I think it would be useful to bring it to a completion.
In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.
Thank you for working on this, I've been thinking that storing imatrix as gguf wold be nice for investigating the use of gradients instead of activations.
I guess unless there are any objections from @bartowski1182 or @danielhanchen we can merge this after the weekend?
Objections? I've been looking forward to this change for months haha
Oh this looks fantastic great work! Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP?
I'll try this out over the month, but maybe in the meantime, I'll temporarily use the legacy format - but overall this change is very welcome! Great work @compilade !
Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP?
@danielhanchen
Currently, with llama-imatrix from the master branch, the chunk size is tied to the ubatch size, which means setting a -ub different than the chunk size leads to some broken behavior.
This PR makes it possible to process multiple chunks in a single ubatch, or use multiple ubatches per chunk. It's not tied together anymore. It also technically allows variable chunk sizes (which will be useful to eventually implement proper chat dataset handling).
Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway).
Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as llama-perplexity already does) by allowing to calculate multiple ubatches per chunk or multiple chunks per (u?)batch.
but maybe in the meantime, I'll temporarily use the legacy format
You'll still get most of the advantages described above even with the old format; both are saved from the same internal data (which was changed to better fit the new format).
The main benefits of the new GGUF-based imatrix format (which (for now) is only used when using a .gguf suffix for the output imatrix file) is saner handling of MoE models, especially when merging imatrix files with different chunk sizes. Also readability by GGUF tooling (e.g. HF previews, gguf-dump, etc.).
While a round-trip conversion is possible, the legacy format contains insufficient shape and counts information for correctly merging imatrix data from MoE models. That's not really a problem when using one-off imatrix files on a single chunk size, though.
(llama-quantize should be able to read both formats)
Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in llama-quantize that the zeros are handled by using imatrix weights of 1 for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do without imatrix. This should allow avoiding problems like #12913.
Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE imatrix files (by adding a 1 value to the squared activations of unused experts, even when those experts are used in another merged imatrix file), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)
Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway).
Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as
llama-perplexityalready does) by allowing to calculate multiple ubatches per chunk or multiple chunks per (u?)batch.
Using a larger batch size will also help on GPU backends for models that don't fit in VRAM, since it reduces the number of times that the weights have to be copied to VRAM. However, usage of the eval callback prevents taking advantage of pipeline parallelism, since after every matrix multiplication there is a full synchronization to copy the results of the operation to the CPU.
Thanks a lot for creating this amazing new imatrix file format and generally improving imatrix computation by a lot. I'm very excited that partial data caused by missing expert activation is now handled properly, thanks to the new file format.
One of the most impactful changes of this PR seems to be imatrix support for 3D tensors. This finally allows generating imatrix quants for models using MLA such as DeepSeek (V2, V2-Lite, V2.5, V3, R1), MiniCPM3-4B, PLM-1.8B, KwaiCoder-DS-V2-Lite, hpc-coder-v2-6b, whale-v3-base-marged) without the Fix imatrix calculation for MLA models patch. This change surprisingly wasn't even mention in the PR description.
TODO: 4d? (is that even used in practice?)
No there is not currently any practical use case for 4D tensors nor do I think there will ever be one. The most dimensions currently required are 3D tensors for MLA.
Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in
llama-quantizethat the zeros are handled by using imatrix weights of1for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do withoutimatrix. This should allow avoiding problems like #12913.Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE
imatrixfiles (by adding a1value to the squared activations of unused experts, even when those experts are used in another mergedimatrixfile), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)
The solution in @nicoboss's fork was inspired by https://github.com/ikawrakow/ik_llama.cpp/pull/202 which does mention this concern (and to me seems to agree with the approach taken here):
Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for IQ1_S_R4 as I did in https://github.com/ikawrakow/ik_llama.cpp/pull/191. So, for now we go with same-importance columns for never activated experts.
Really looking forward to this PR being merged into master!
In the meantime, you may already know this but passing along a tip shared by @David-AU-github in here that has worked for me when dealing with imatrices with partial activations in MoEs: increase the model's number of active experts (if KV override is supported), then calib / imatrix.
To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.
save_imatrix: saving to legacy imatrix format because output suffix is not .gguf
I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in llama-quantize, this allows both correct imatrix.gguf combining and weighting missing data neutrally).
And to make the old format a bit more equivalent in quality to the new format (except when combining multiple imatrix.dat files with --in-file), I've made it write 1 values where the evaluation count is zero, a bit like in nicoboss's fork, but without modifying the internal data (and so intermediate saving will not affect the final result). (this is different than my previous stance in the last paragraph of https://github.com/ggml-org/llama.cpp/pull/9400#issuecomment-3043442903, because I realized dropping data would also affect combining imatrix files, and since most people don't combine imatrix files, then having the same behavior as the new format in the most common use case is saner)
I've also removed the need to load a model when converting between formats (it was already kind of like this when combining imatrix files), and so the following should be possible:
[!WARNING]
The syntax has changed in https://github.com/ggml-org/llama.cpp/pull/14842
$ ./bin/llama-imatrix --in-file imatrix.dat -o imatrix.gguf
$ ./bin/llama-imatrix --in-file imatrix.gguf -o imatrix-roundtrip.dat
$ ./bin/llama-imatrix --in-file imatrix-roundtrip.dat -o imatrix-roundtrip.gguf
Note that shape information for evaluation counts of MoE tensors is missing from legacy imatrix files, and so it will also be missing from the converted imatrix.gguf file, except if more data is provided or if it's merged with a fresh imatrix.gguf file of the same model. (it will still work with llama-quantize, even when the evaluation count shape is flattened; GGUF makes it easy to support that)
Preserving the shape of evaluation counts is partly why it's recommended to use .gguf for newly-generated imatrix files.
The forced suffix of .gguf for GGUF-based imatrix files might be controversial, but a .gguf suffix is necessary for HuggingFace to display its GGUF previews anyway (even though technically GGUF has a magic header and so it can be identified from its contents). This restriction will likely be removed once writing to the old format isn't supported anymore (in a future PR, not this one).
Since the old format doesn't have a magic header, llama-quantize will always try to load imatrix files as GGUF first, and fallback to the legacy format when it fails (this means the filename suffix of imatrix files technically doesn't matter at load time).
@compilade Thanks a lot for your hard work. I'm really looking forward to this PR getting merged! Everything is perfect now in my opinion.
To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.
Thanks a lot for listening to our feedback and adding this warning. This should be enough to warn users that accidentally still use the legacy imatrix.dat file format after this is merged.
I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in llama-quantize, this allows both correct
imatrix.ggufcombining and weighting missing data neutrally).
Thanks a lot! This is super useful to judge the quality of the imatrix and imatrix dataset as a great imatrix dataset should cover more experts than a bad one (unfortunately are always cases where even the best imatrix dataset can't cover them all as training the router to eventually make use of all experts seem quite hard and so was not done well for all MoE models).
I've made it write 1 values where the evaluation count is zero, a bit like in https://github.com/nicoboss/llama.cpp/pull/1, but without modifying the internal data (and so intermediate saving will not affect the final result).
That's super cool. I hated how my patch broke intermediate saving both by them affecting the result and by them hiding how much experts are covered for future saves and why we had to disable intermediate saves. This is such an elegant solution!
I've also removed the need to load a model when converting between formats
I really appreciate and find it super cool how easily conversion between the legacy imatrix.dat and new imatrix.gguf file format is possible. Not only that you went out of your way to make booth backwards and forwards compatibility as good as possible for everyone.
The forced suffix of .gguf for GGUF-based imatrix files might be controversial, but a .gguf suffix is necessary for HuggingFace to display its GGUF previews anyway (even though technically GGUF has a magic header and so it can be identified from its contents).
With there now being a warning if someone doesn’t specify the .gguf suffix I find this design choose acceptable especially given that you have valid backwards compatibility reasons for it to be this way.
This restriction will likely be removed once writing to the old format isn't supported anymore (in a future PR, not this one).
I'm looking forward to that. I appreciate that you give everyone time to slowly adopt the new file format. Please just don't forget to eventually drop writing support for the legacy imatrix.dat file format and allow arbitrary imatrix file name suffix.
Since the old format doesn't have a magic header, llama-quantize will always try to load imatrix files as GGUF first, and fallback to the legacy format when it fails (this means the filename suffix of imatrix files technically doesn't matter at load time).
So the .gguf suffix is only forced at write time so the file can be rename afterwards as it will always try to load the file as GGUF first. That’s really nice. Not that there really is any reason to not have them end with .gguf given that they are GGUF files but great that we can name them any way we want.
Echoing @nicoboss' sentiment. This is a very nice enhancement @compilade. Thank you.
I've been testing by running different permutations of options, including roundtrips on each test, and comparing the resulting stats. As far as I can tell, everything checks out!
The only, minor, observation is that when converting an existing file to the new format, a gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF' warning is displayed. Other than that, it works like a charm!
The only, minor, observation is that when converting an existing file to the new format, a
gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF'warning is displayed. Other than that, it works like a charm!
Unavoidable, but a vast improvement from previous behaviour, see #14381. :)
I released a manline imatrix for Kimi-K2-Instruct using this PR here if anyone is looking for it given it is challenging to compute: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF?show_file_info=mainline%2Fimatrix-mainline-pr9400-plus-kimi-k2-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf (interestingly the imatrix.gguf shows up in the hf model tensor viewer)
Feel free to use it for cooking your own custom mainline quants. I also have a version for ik_llama.cpp if that is your thing.
Might be cool to look inside it with Ed's imatrix stats tool: https://github.com/ggml-org/llama.cpp/pull/12718
Thanks!
@compilade Time to merge this (and adapt #12718 afterwards)?
Assuming no additional changes on this PR, the enhanced version of #12718 is ready to go as soon as this one is merged
@compilade Time to merge this (and adapt #12718 afterwards)?
@CISC Sure. I hope I've tested enough edge cases. Will merge at 16:00 UTC on 2025-07-19 (in around 10 hours), to give some buffer for last-minute problems.
(sorry for the delayed reply; recently made changes to my home network, now got symmetric fiber Internet)