llama.cpp llama : change fallback type IQ4_NL -> Q4

https://github.com/ggerganov/llama.cpp/issues/7805#issuecomment-2227826702

Fallback to Q4_0 instead of IQ4_NL due to lack of implementation in of the latter in some backends

[x] I have read the contributing guidelines
Self-reported review complexity:
- [ ] Low
- [ ] Medium
- [ ] High

Jul 15 '24 07:07 ggerganov

This will result in (possibly significantly) higher ppl, eventually somebody will notice and waste a lot of time trying to figure why.

Jul 15 '24 13:07 slaren

@slaren hmm I think it's worth it for the increased compatibility. It would be easy to find an unaffected model to do an apples-to-apples comparison when evaluating perplexity.

We did the same thing in https://github.com/ggerganov/llama.cpp/pull/2148 , so if it's a concern we could tag on a similar warning during the quantization process when the fallback is triggered, or even add such a warning when calculating perplexity.

Jul 15 '24 13:07 LostRuins

I am not opposed to the change, but it is not a minor change and should be much more clear (it's not even in the PR title).

Jul 15 '24 13:07 slaren

I don't have the system resources required to run Qwen 72B myself, but if anyone knows of another model that uses quantized fallbacks i'll be happy to test that!

@LostRuins There's a bunch of very tiny models with weird sizes, one of which is https://huggingface.co/delphi-suite/v0-llama2-12.8m (12.8M, yes millions), with an embedding size of 384 (not a multiple of 256).

If you were asking for bigger models, there's https://huggingface.co/1bitLLM/bitnet_b1_58-3B (3B) which has an embedding size of 3200 (not a multiple of 256).

(for a cursed model there's also https://huggingface.co/1bitLLM/bitnet_b1_58-xl (1.3B) which has an FFN size of 5460, which is not even a multiple of 32 (there are also tiny cursed models in https://huggingface.co/delphi-suite, but by necessity (small size) instead of accidentally))

Jul 16 '24 02:07 compilade

I moved the styling changes in a separate PR (#8502) and updated this one to contain just the type change

Jul 16 '24 07:07 ggerganov

I am not a huge friend of this change. I think this should also add a option to allow use of IQ4NL instead of Q4_0 as fallback

Jul 16 '24 07:07 sorasoras

@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor.

also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.

case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0

If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most anyway

Jul 16 '24 08:07 LostRuins

@sorasoras may I ask why? I doubt the perplexity difference of an already heavily compressed q2k model is all that much different with a fallback of one q4_nl to q4_0 tensor.

also remember that the current fallback for larger quants like q4_k_s is simply good old q5_0... and nobody has an issue with that.
case GGML_TYPE_Q4_K:   new_type = GGML_TYPE_Q5_0
If opposed to q4_0, then simply fall back to q5_0 like larger quants, the size difference will be under a few percent at most

even in Q2KS, fallback to Iq4nl vs q4_0 could have outsize influence on PPL. make it Q5_0 would make it larger.

Fallback to Q4_0 should be default behavior but only with option to enable IQ4NL fallback. Q5_0 is a lot bigger than Q4_0 and especially IQ4NL. Otherwise, I have to maintain a separate repo. Q4_0 is a lot worse than IQ4NL in my use case.

A kind of special case but It does affect quite a bit in term of PPL at translation with Qwen1 14B Q2KS 5.1283 +/- 0.04957 old
Q2KS 5.1090 +/- 0.04962 new with IQ4NL

Jul 16 '24 10:07 sorasoras

I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely.

Jul 17 '24 02:07 slaren

@slaren I would think that IQ4_NL would be the superior option when combined with other I-Quant mixes and distributed as an iquant model. But ideally keeping within it's own category - for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model.

I have not done my own perplexity tests but @sorasoras did highlight a change from Q2KS 5.1283 to Q2KS 5.1090 with IQ4NL on Qwen, or about 0.3% difference - to me that represents a mostly negligible change, but for those who find that important they can continue using the dedicated i-quanted model types.

Jul 17 '24 03:07 LostRuins

I don't see why that is ideal. I understand that the main motivation for the request to make this change is to workaround the issue of the Vulkan backend not supporting IQ quants. I am not that we should be making these changes based on the limitations of some backends.

Quoting ikawrakow in the PR where he introduced IQ4_NL:

The main purpose of this PR is to provide a 4-bit quantization type that can be used when k- and i-quants that use blocks of 256 are not available (because the number of columns in some tensors are not a multiple of 256).

Jul 17 '24 03:07 slaren

I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant.

When people do quantize and select Q2_K or Q3_K_S, then the output should be K-Quant compatible and work on K-Quant compatible backends. If we want to utilize a IQ4_NL type by all means, but then the output should be labelled as IQX_X instead and made clear on the quantization options that you are getting an I-Quant instead.

Jul 17 '24 05:07 LostRuins

for the same reason we avoid taking K-quanted tensors and mixing them into a q5_0 or q4_1 model.

Note that q6_K is still used for output.weight when making a q4_0, q4_1 or q5_0 model, when possible.

https://github.com/ggerganov/llama.cpp/blob/5e116e8dd51775f8f1c090570be148d5d7eea6c3/src/llama.cpp#L17863-L17864

Jul 17 '24 05:07 compilade

I am not opposed to the existence of IQ4_NL quants, but then the resulting format should not be considered a K-Quant.

When people do quantize and select Q2_K or Q3_K_S, then the output should be K-Quant compatible and work on K-Quant compatible backends. If we want to utilize a IQ4_NL type by all means, but then the output should be labelled as IQ2_S instead and made clear on the quantization options that you are getting an I-Quant instead.

You are making decision to increase Q2KS PPL for a lot of backend just to make few backend usable. I think you might want to make a option to exclude Iquant during quant can call it compatible mode or default to Qquant only but with option to include IQquant mix. Ideally, It should be allow user to use different quant mix to reflect the different backend needs. Q2KS and IQ2_2 are very different in that Q2KS don't need imatrix where IQ2S do. Q2KS with IQ4NL is lot more potent in my test of translation than the old variant.

Jul 17 '24 06:07 sorasoras

I would also suggest checking the assumption that this change will not increase the perplexity significantly. As far as I know, the main reason iq4_nl exists is precisely to be used as a fallback quant when the row size is not a multiple of 256. If this use case is removed, iq4_nl may as well be removed entirely.

If Vulkan really is the only backend that lacks iq4_nl support and it's really only one quant that's needed, might be better to put in the effort to implement it there. I haven't looked at iq quants yet cause I don't have that much time and new quants are tedious to implement.

Jul 17 '24 06:07 0cc4m

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Jul 17 '24 06:07 LostRuins

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

Jul 17 '24 06:07 oldgithubman

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

This can't be isolated to a single backend, cause it affects new models that get quantized and shared. The Vulkan backend supports k-quants, but it does not support the latest q2_k and q3_k (?) models cause they use a single iq4_nl tensor. That's what this is about.

Jul 17 '24 06:07 0cc4m

Anyway please don't take this the wrong way - I am simply highlighting the feedback other users and myself observed, when using various community created/quantized models in the wild. I will support whatever decision gg makes.

Good points on both sides. Ultimately, if you're going to damage more-popular backends, I think it makes more sense to isolate this change to Vulkan, no?

This can't be isolated to a single backend

Admittedly, I might not understand exactly what's going on here, but I hear this sort of statement a lot around here ("can't"), and that's rarely true. I think it should be replaced with, "I haven't figured out how to yet" (see the recent situation involving breaking stuff due to renaming - there were ways to do it without breaking things - they just didn't think of them). Considering this is proposing damaging more-popular backends to benefit the less-popular, I think it's worth brainstorming ways to do what you say can't be done. Again, I admit this is above my paygrade, but I have a hard time believing it's impossible. For example, why can't one convert the tensor to Q4_0 if they want to run Vulkan? Also, why can't Vulkan run IQ4_NL in the first place? Is it something llama.cpp can implement/fix/adapt/whatever? Just brainstorming

Jul 17 '24 22:07 oldgithubman

Obsoleted by #8613

Jul 26 '24 19:07 ggerganov

llama : change fallback type IQ4_NL -> Q4_0