byte-6174 comments

Results 19 comments of


                                            byte-6174

8-bit Quantization

Following up with the little pseudo-code above, checked in a small experiment that does weights/acts quant on the fly. needs more experiments as blindly turning all matmults run on int8...

8-bit Quantization

yes, that change was needed for llama2 7B model. thanks @kroggen ! ``` ./runq data.bin -n 16 -i "why is sky blue?" why is sky blue? Here's a theory everyone's...

8-bit Quantization

Yes. Llama2.cpp has groups 64 etc. Why risky ?

8-bit Quantization

Humm. Trying to understand this, So the groups of 64 avoids this how ? You mean outlier in the magnitude sense I'm presuming ?

8-bit Quantization

Btw as a side note : there is experimental evidence,in llama.cpp and also places like llm.int8, of needing mixed precision to tackle outliers. Thought we might want to / have...

Quantization Brainstorming

@jrudolph for activation quantization, do you use data statistics? if so what data is used? if no data is used, how are the acts calculated in your scala implementation? sorry,...

Quantization Brainstorming

I'm trying to understand how it is done from the links you provided above, at the base level. I understand that there are many complex mix-and-match strategies like keeping certain...

Quantization Brainstorming

got it, thanks..

Quantization Brainstorming

@atamurad: > LLama2-7B-chat, Q8_A/Q8_B => 6.7GB model size, output is OK but slow as expected. slower than float32 model?