trafficstars

Follow-up from https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10465793.

Using GGUF as the format for imatrix files will be useful for further experiments (e.g. with L²QER) and compatibility with existing or future GGUF tooling (e.g. GGUF previews on HuggingFace, graphical GGUF viewer(s) https://github.com/ggerganov/llama.cpp/issues/6715, some kind of gguf-diff, etc.).

There are multiple problems with imatrix which this is addressing:

Ad-hoc format which isn't really readable by other projects (and which has no way to backward-compatibly be extended except by adding more stuff at the end)
Non-deterministic tensor order depending on unordered_map iteration order (makes sha256sum useless to compare imatrix files made on the same dataset)
Broken behavior at small -ub (intermediate saves happen waaay too often)
Can't use bigger batch size than chunk size

Summary of changes

Use GGUF to store imatrix data.
- general.type is imatrix
- no general.architecture
  - can't really know the architecture from old imatrix files.
- store *.sums and *.counts for each tensors with imatrix data.
  - *.sums are the sums of activations
    - Stored in F32, like before.
  - *.counts are the number of activations (also the number of tokens), useful to calculate the mean
    - Why not simply store the mean? To allow merging imatrix files together with --in-file.
    - It's stored in F32 even though it's integer values, because when calculating the mean it would be converted to F32 anyway to perform the division.
Add convert_legacy_imatrix_to_gguf.py to convert old imatrix.dat files to imatrix.gguf
Like llama-perplexity since #5946, allow computing multiple chunks per batch with llama-imatrix
- This should be useful for huge models like Llama-405B when they don't fit completely in RAM.
Use fused-multiply-add (with std::fma) when accumulating the sums of activations
- Shouldn't hurt to somewhat reduce rounding errors
  - (obviously f64 would be even better, but I'm not use it's worth it yet. For the curious, using double for the intermediate accumulations can be tried by changing only one line in IMatrixStats: vector<float> values to vector<double> values.)
Sort the tensor names before serializing
- This makes the tensor order deterministic, because otherwise it depended on the iteration order of unordered_map.
  - Determinism between runs means sha256sum can be meaningfully used to compare imatrix files generated in very similar conditions.

TODO

[ ] Compare old llama-quantize with old imatrix.dat with new llama-quantize using converted imatrix.gguf
- Seemed to work, but need to re-test. The resulting quantized model(s) should have the same sha256sum.
[x] Test new llama-imatrix at different batch sizes
- Same checksums with -ub 64 -b 512 and -ub 512 -b 2048 for a chunk size of 512 (-c 512)
[ ] Perplexity test(s) with i-quants with old llama-imatrix vs new llama-imatrix
[ ] Test with MoE models (perplexity with i-quants should be in the same ballpark as before)
[ ] Test --in-file with llama-imatrix
[ ] (maybe) Implement cleaner general.architecture exclusion.
- Currently, this uses a subclass to make self.add_architecture() a no-op, but maybe general.architecture should simply be excluded when self.arch == "". Not sure how to prevent using the other self.add_* (in GGUFWriter) which expect self.arch to be something.
- Or maybe the architecture should be included?
  - What about conversions from older imatrix.dat files?

[x] I have read the contributing guidelines
Self-reported review complexity:
- [x] Medium

Sep 10 '24 02:09 compilade

I'm setting this to "draft", because of concerns by @ikawrakow in https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10615399 and https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10626253 (mostly related to the fact that GGUF is harder to parse than imatrix.dat files).

More details near the end of https://github.com/ikawrakow/ik_llama.cpp/discussions/15#discussioncomment-10632253.

I'll need some days to think about how to go further with this.

Sep 13 '24 03:09 compilade

@compilade This is a good change and I think it would be useful to bring it to a completion.

In the future, we can extend libllama with an interface for saving/loading imatrix data. This way the implementation for reading and writing the imatrix data would be localized in libllama and can be kept in-sync more easily. This can be combined with the refactoring of llama_model_quantize_params to not pass C++ objects.

Apr 08 '25 07:04 ggerganov

Thank you for working on this, I've been thinking that storing imatrix as gguf wold be nice for investigating the use of gradients instead of activations.

Jun 25 '25 19:06 JohannesGaessler

I guess unless there are any objections from @bartowski1182 or @danielhanchen we can merge this after the weekend?

Jul 05 '25 11:07 CISC

Objections? I've been looking forward to this change for months haha

Jul 05 '25 11:07 bartowski1182

Oh this looks fantastic great work! Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP?

I'll try this out over the month, but maybe in the meantime, I'll temporarily use the legacy format - but overall this change is very welcome! Great work @compilade !

Jul 06 '25 07:07 danielhanchen

Re using bigger batch sizes - does this mean if memory allows, imatrix should be in fact faster to process via PP?

@danielhanchen

Currently, with llama-imatrix from the master branch, the chunk size is tied to the ubatch size, which means setting a -ub different than the chunk size leads to some broken behavior.

This PR makes it possible to process multiple chunks in a single ubatch, or use multiple ubatches per chunk. It's not tied together anymore. It also technically allows variable chunk sizes (which will be useful to eventually implement proper chat dataset handling).

Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway).

Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as llama-perplexity already does) by allowing to calculate multiple ubatches per chunk or multiple chunks per (u?)batch.

but maybe in the meantime, I'll temporarily use the legacy format

You'll still get most of the advantages described above even with the old format; both are saved from the same internal data (which was changed to better fit the new format).

The main benefits of the new GGUF-based imatrix format (which (for now) is only used when using a .gguf suffix for the output imatrix file) is saner handling of MoE models, especially when merging imatrix files with different chunk sizes. Also readability by GGUF tooling (e.g. HF previews, gguf-dump, etc.).

While a round-trip conversion is possible, the legacy format contains insufficient shape and counts information for correctly merging imatrix data from MoE models. That's not really a problem when using one-off imatrix files on a single chunk size, though.

(llama-quantize should be able to read both formats)

Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in llama-quantize that the zeros are handled by using imatrix weights of 1 for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do without imatrix. This should allow avoiding problems like #12913.

Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE imatrix files (by adding a 1 value to the squared activations of unused experts, even when those experts are used in another merged imatrix file), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)

Jul 07 '25 04:07 compilade

Using bigger ubatch sizes can be useful for models which don't fit in RAM to load the weights less often (assuming mmap is used, since the weights are read once per ubatch anyway).

Not sure how the GPU backends would handle that situation (with very large models), although when it does fit, this still allows benefiting from pipeline parallelism (as llama-perplexity already does) by allowing to calculate multiple ubatches per chunk or multiple chunks per (u?)batch.

Using a larger batch size will also help on GPU backends for models that don't fit in VRAM, since it reduces the number of times that the weights have to be copied to VRAM. However, usage of the eval callback prevents taking advantage of pipeline parallelism, since after every matrix multiplication there is a full synchronization to copy the results of the operation to the CPU.

Jul 07 '25 09:07 slaren

Thanks a lot for creating this amazing new imatrix file format and generally improving imatrix computation by a lot. I'm very excited that partial data caused by missing expert activation is now handled properly, thanks to the new file format.

One of the most impactful changes of this PR seems to be imatrix support for 3D tensors. This finally allows generating imatrix quants for models using MLA such as DeepSeek (V2, V2-Lite, V2.5, V3, R1), MiniCPM3-4B, PLM-1.8B, KwaiCoder-DS-V2-Lite, hpc-coder-v2-6b, whale-v3-base-marged) without the Fix imatrix calculation for MLA models patch. This change surprisingly wasn't even mention in the PR description.

TODO: 4d? (is that even used in practice?)

No there is not currently any practical use case for 4D tensors nor do I think there will ever be one. The most dimensions currently required are 3D tensors for MLA.

Jul 07 '25 13:07 nicoboss

Oh, there's another behavior which differs: when a tensor sub-matrice was not solicited by a dataset (e.g. a MoE model with unsolicited experts), the old format skips the entire tensor (even when e.g. 95% of the experts were solicited), while the new format keeps it, and it's in llama-quantize that the zeros are handled by using imatrix weights of 1 for that sub-matrice, which from my experiments around #12557, seems reasonable, and mostly matches what the quantization functions do without imatrix. This should allow avoiding problems like #12913.

Partial data is not handled at read time for the legacy format because of insufficient shape information. (It could be handled at write time (as in @nicoboss's fork), but that workaround would incorrectly affect merges of MoE imatrix files (by adding a 1 value to the squared activations of unused experts, even when those experts are used in another merged imatrix file), although arguably the current behavior (dropping the data for the entire tensor) is also wrong. Both problems cannot be fixed simultaneously without a different format than the legacy one.)

The solution in @nicoboss's fork was inspired by https://github.com/ikawrakow/ik_llama.cpp/pull/202 which does mention this concern (and to me seems to agree with the approach taken here):

Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for IQ1_S_R4 as I did in https://github.com/ikawrakow/ik_llama.cpp/pull/191. So, for now we go with same-importance columns for never activated experts.

Jul 08 '25 04:07 saood06

Really looking forward to this PR being merged into master!

In the meantime, you may already know this but passing along a tip shared by @David-AU-github in here that has worked for me when dealing with imatrices with partial activations in MoEs: increase the model's number of active experts (if KV override is supported), then calib / imatrix.

Jul 12 '25 09:07 EAddario

To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.

save_imatrix: saving to legacy imatrix format because output suffix is not .gguf

I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in llama-quantize, this allows both correct imatrix.gguf combining and weighting missing data neutrally).

And to make the old format a bit more equivalent in quality to the new format (except when combining multiple imatrix.dat files with --in-file), I've made it write 1 values where the evaluation count is zero, a bit like in nicoboss's fork, but without modifying the internal data (and so intermediate saving will not affect the final result). (this is different than my previous stance in the last paragraph of https://github.com/ggml-org/llama.cpp/pull/9400#issuecomment-3043442903, because I realized dropping data would also affect combining imatrix files, and since most people don't combine imatrix files, then having the same behavior as the new format in the most common use case is saner)

I've also removed the need to load a model when converting between formats (it was already kind of like this when combining imatrix files), and so the following should be possible:

[!WARNING]

The syntax has changed in https://github.com/ggml-org/llama.cpp/pull/14842

$ ./bin/llama-imatrix --in-file imatrix.dat -o imatrix.gguf
$ ./bin/llama-imatrix --in-file imatrix.gguf -o imatrix-roundtrip.dat
$ ./bin/llama-imatrix --in-file imatrix-roundtrip.dat -o imatrix-roundtrip.gguf

Note that shape information for evaluation counts of MoE tensors is missing from legacy imatrix files, and so it will also be missing from the converted imatrix.gguf file, except if more data is provided or if it's merged with a fresh imatrix.gguf file of the same model. (it will still work with llama-quantize, even when the evaluation count shape is flattened; GGUF makes it easy to support that)

Preserving the shape of evaluation counts is partly why it's recommended to use .gguf for newly-generated imatrix files.

The forced suffix of .gguf for GGUF-based imatrix files might be controversial, but a .gguf suffix is necessary for HuggingFace to display its GGUF previews anyway (even though technically GGUF has a magic header and so it can be identified from its contents). This restriction will likely be removed once writing to the old format isn't supported anymore (in a future PR, not this one).

Since the old format doesn't have a magic header, llama-quantize will always try to load imatrix files as GGUF first, and fallback to the legacy format when it fails (this means the filename suffix of imatrix files technically doesn't matter at load time).

Jul 12 '25 20:07 compilade

@compilade Thanks a lot for your hard work. I'm really looking forward to this PR getting merged! Everything is perfect now in my opinion.

To address some feedback I got recently, I've added a warning when writing using the legacy format so that it's more obvious what is happening.

Thanks a lot for listening to our feedback and adding this warning. This should be enough to warn users that accidentally still use the legacy imatrix.dat file format after this is merged.

I've also added back the warnings for partial data for the new format, because it can still be useful to know that is happening, even if the data is not omitted (partial data is handled at read-time in llama-quantize, this allows both correct imatrix.gguf combining and weighting missing data neutrally).

Thanks a lot! This is super useful to judge the quality of the imatrix and imatrix dataset as a great imatrix dataset should cover more experts than a bad one (unfortunately are always cases where even the best imatrix dataset can't cover them all as training the router to eventually make use of all experts seem quite hard and so was not done well for all MoE models).

I've made it write 1 values where the evaluation count is zero, a bit like in https://github.com/nicoboss/llama.cpp/pull/1, but without modifying the internal data (and so intermediate saving will not affect the final result).

That's super cool. I hated how my patch broke intermediate saving both by them affecting the result and by them hiding how much experts are covered for future saves and why we had to disable intermediate saves. This is such an elegant solution!

I've also removed the need to load a model when converting between formats

I really appreciate and find it super cool how easily conversion between the legacy imatrix.dat and new imatrix.gguf file format is possible. Not only that you went out of your way to make booth backwards and forwards compatibility as good as possible for everyone.

The forced suffix of .gguf for GGUF-based imatrix files might be controversial, but a .gguf suffix is necessary for HuggingFace to display its GGUF previews anyway (even though technically GGUF has a magic header and so it can be identified from its contents).

With there now being a warning if someone doesn’t specify the .gguf suffix I find this design choose acceptable especially given that you have valid backwards compatibility reasons for it to be this way.

This restriction will likely be removed once writing to the old format isn't supported anymore (in a future PR, not this one).

I'm looking forward to that. I appreciate that you give everyone time to slowly adopt the new file format. Please just don't forget to eventually drop writing support for the legacy imatrix.dat file format and allow arbitrary imatrix file name suffix.

Since the old format doesn't have a magic header, llama-quantize will always try to load imatrix files as GGUF first, and fallback to the legacy format when it fails (this means the filename suffix of imatrix files technically doesn't matter at load time).

So the .gguf suffix is only forced at write time so the file can be rename afterwards as it will always try to load the file as GGUF first. That’s really nice. Not that there really is any reason to not have them end with .gguf given that they are GGUF files but great that we can name them any way we want.

Jul 13 '25 19:07 nicoboss

Echoing @nicoboss' sentiment. This is a very nice enhancement @compilade. Thank you.

I've been testing by running different permutations of options, including roundtrips on each test, and comparing the resulting stats. As far as I can tell, everything checks out!

The only, minor, observation is that when converting an existing file to the new format, a gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF' warning is displayed. Other than that, it works like a charm!

Jul 13 '25 20:07 EAddario

The only, minor, observation is that when converting an existing file to the new format, a gguf_init_from_file_impl: invalid magic characters: '????', expected 'GGUF' warning is displayed. Other than that, it works like a charm!

Unavoidable, but a vast improvement from previous behaviour, see #14381. :)

Jul 14 '25 09:07 CISC

I released a manline imatrix for Kimi-K2-Instruct using this PR here if anyone is looking for it given it is challenging to compute: https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF?show_file_info=mainline%2Fimatrix-mainline-pr9400-plus-kimi-k2-942c55cd5-Kimi-K2-Instruct-Q8_0.gguf (interestingly the imatrix.gguf shows up in the hf model tensor viewer)

Feel free to use it for cooking your own custom mainline quants. I also have a version for ik_llama.cpp if that is your thing.

Might be cool to look inside it with Ed's imatrix stats tool: https://github.com/ggml-org/llama.cpp/pull/12718

Thanks!

Jul 16 '25 18:07 ubergarm

@compilade Time to merge this (and adapt #12718 afterwards)?

Jul 17 '25 06:07 CISC

Assuming no additional changes on this PR, the enhanced version of #12718 is ready to go as soon as this one is merged

Jul 17 '25 10:07 EAddario

@compilade Time to merge this (and adapt #12718 afterwards)?

@CISC Sure. I hope I've tested enough edge cases. Will merge at 16:00 UTC on 2025-07-19 (in around 10 hours), to give some buffer for last-minute problems.

(sorry for the delayed reply; recently made changes to my home network, now got symmetric fiber Internet)

Jul 19 '25 06:07 compilade

llama.cpp
llama.cpp copied to clipboard

imatrix : use GGUF to store importance matrices

Summary of changes

TODO

llama.cpp llama.cpp copied to clipboard

imatrix : use GGUF to store importance matrices

Summary of changes

TODO

llama.cpp
llama.cpp copied to clipboard