rwkv.cpp Consider uploading some quantized checkpoints to hugginface

Correct me if I'm wrong but quantizing would require loading the models in their unquantized form (as per torch.load in https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/convert_pytorch_to_ggml.py, line 126). Not to mention how much heavier the unquantized models are on bandwidths.

Apr 21 '23 14:04 Calandiel

Only PyTorch -> rwkv.cpp conversion would require to load the whole model in the RAM; quantization is done tensor-by-tensor. You are right about the bandwidth tho.

I'll consider it, thanks for the suggestion!

Apr 21 '23 15:04 saharNooby

I have uploaded some quantized RWKV-4-Raven models to HuggingFace at LoganDark/rwkv-4-raven-ggml. Conversion took about 2 hours, and upload took about 24 hours and 500GB of disk space.

At the time of writing, the available models are:

Name	`f32`	`f16`	`Q4_0`	`Q4_1`	`Q4_2`	`Q5_1`	`Q8_0`
`RWKV-4-Raven-1B5-v11-Eng99-20230425-ctx4096`	Yes	Yes	Yes	No	Yes	Yes	Yes
`RWKV-4-Raven-3B-v11-Eng99-20230425-ctx4096`	Yes	Yes	Yes	No	Yes	Yes	Yes
`RWKV-4-Raven-7B-v11x-Eng99-20230429-ctx8192`	Yes	Yes	Yes	No	Yes	Yes	Yes
`RWKV-4-Raven-14B-v11x-Eng99-20230501-ctx8192`	Split	Yes	Yes	No	Yes	Yes	Yes

Feel free to create a discussion if you have a request.

May 19 '23 23:05 LoganDark