byte-6174
byte-6174
Acts quant most likely would need fine tuning though, wouldn’t it? So much more work needed to get that in good shape- huge potential runtime gains once we have it...
@karpathy I have been pouring thru literature about this for a few days now and mostly points to need for ft. see for eg [ibert](https://arxiv.org/pdf/2101.01321v3.pdf). But there are many more...
btw, this PR also works for quantizing the `llama2 7B` model as well. compression from 25GB to 6.2GB. 🎆
btw, re acts quants, I am overlooking the work I did at llama.cpp. :) this is the perplexity results we found there: https://github.com/ggerganov/llama.cpp/issues/2379#issuecomment-1661385125 So yes, we should definitely do this....
And just for more context re. the ggml port - there is follow-up discussion on that thread above re. how int8 and int4 didn't work at all for stories15 and...
@karpathy yeah that was a discussion point and very well could be the source of degradation. Which we would see here as well yes?! Given the models are the same?
Re symmetric quant our earlier version did have that implemented but we carried more tests with asymmetric and decided to keep that.
agree, in that regard symmetric > asymmetric! digging into this more rn and reading up. It appears the following approach could work: 1. convert Ws to float16 and save. (mostly...
> I think if I end up training more stories models I will reach for fp16 optimizer and gradient scaler instead of bf16 just so that exported models can use...
more discussion here: https://github.com/ggerganov/llama.cpp/issues/397#issuecomment-1493381230 and on that page...