Implement quantization on-the-fly
This feature allows to quantize FP32/FP16 models on-the-fly to any other quantized format, without the need to explicitly run quantize.py and keep quantized models on disk.
Intended use-case is having only FP16 model saved on the disk and not wasting disk space on quantized models of all possible formats.
Furthermore, if quantization format changes again, those who use quantization on-the-fly will not even notice it, since updated rwkv.cpp will just use new format when loading the FP16 model.
@LoganDark Thanks for describing the roadmap! Let's wait until API redesign then. I hope it won't be too breaking :)
I'll leave this PR hanging as a draft until new the loading method is available, so that users who want to use on-the-fly quantization now can notice and use this branch.
I hope it won't be too breaking :)
It should be possible to reimplement the current API in terms of the new one, in order to keep compatibility with existing programs :)