llama.cpp [feature request] conversion to gguf in a more pure form.

Hello, usually when quantizing I first convert a huggingface model to F16 gguf then I quantize that to my quantizations. I have noticed that convert does not produce a "pure" f16. I think there should be a flag as in the quantize program to allow a pure f16 (all tensors) or pure bf16 conversion.

Jun 24 '24 02:06 0wwafa

I have noticed that convert does not produce a "pure" f16.

Do you mean that some tensors are in F32 in the resulting gguf model? These are usually 1D tensors which are very small anyway. (BTW, even llama-quantize --pure ... keeps 1D tensors as F32)

Some of the ggml operators used on 1D tensors (currently) only work on F32 tensors (e.g. ggml_norm), so a pure f16 gguf model would not work without modifications in ggml.c.

Is there a particular reason why you'd like extremely "pure" conversions?

Jun 24 '24 05:06 compilade

Is there a particular reason why you'd like extremely "pure" conversions?

well. no.. I mean I wanted to make comparisons between f16 "pure" and my own quants (which are a mix of f16 and q5 or q6). They seem to be smaller at no cost.. almost no degradation. You can find those quants in my huggingface profile page under models: https://huggingface.co/ZeroWw

Jun 24 '24 19:06 0wwafa

This issue was closed because it has been inactive for 14 days since being marked as stale.

Aug 08 '24 01:08 github-actions[bot]