rwkv.cpp icon indicating copy to clipboard operation
rwkv.cpp copied to clipboard

Implement quantization on-the-fly

Open saharNooby opened this issue 2 years ago • 2 comments

This feature allows to quantize FP32/FP16 models on-the-fly to any other quantized format, without the need to explicitly run quantize.py and keep quantized models on disk.

Intended use-case is having only FP16 model saved on the disk and not wasting disk space on quantized models of all possible formats.

Furthermore, if quantization format changes again, those who use quantization on-the-fly will not even notice it, since updated rwkv.cpp will just use new format when loading the FP16 model.

saharNooby avatar Jun 14 '23 15:06 saharNooby

@LoganDark Thanks for describing the roadmap! Let's wait until API redesign then. I hope it won't be too breaking :)

I'll leave this PR hanging as a draft until new the loading method is available, so that users who want to use on-the-fly quantization now can notice and use this branch.

saharNooby avatar Jun 15 '23 11:06 saharNooby

I hope it won't be too breaking :)

It should be possible to reimplement the current API in terms of the new one, in order to keep compatibility with existing programs :)

LoganDark avatar Jun 15 '23 14:06 LoganDark