Consider uploading some quantized checkpoints to hugginface
Correct me if I'm wrong but quantizing would require loading the models in their unquantized form (as per torch.load in https://github.com/saharNooby/rwkv.cpp/blob/master/rwkv/convert_pytorch_to_ggml.py, line 126). Not to mention how much heavier the unquantized models are on bandwidths.
Only PyTorch -> rwkv.cpp conversion would require to load the whole model in the RAM; quantization is done tensor-by-tensor. You are right about the bandwidth tho.
I'll consider it, thanks for the suggestion!
I have uploaded some quantized RWKV-4-Raven models to HuggingFace at LoganDark/rwkv-4-raven-ggml. Conversion took about 2 hours, and upload took about 24 hours and 500GB of disk space.
At the time of writing, the available models are:
| Name | f32 |
f16 |
Q4_0 |
Q4_1 |
Q4_2 |
Q5_1 |
Q8_0 |
|---|---|---|---|---|---|---|---|
RWKV-4-Raven-1B5-v11-Eng99-20230425-ctx4096 |
Yes | Yes | Yes | No | Yes | Yes | Yes |
RWKV-4-Raven-3B-v11-Eng99-20230425-ctx4096 |
Yes | Yes | Yes | No | Yes | Yes | Yes |
RWKV-4-Raven-7B-v11x-Eng99-20230429-ctx8192 |
Yes | Yes | Yes | No | Yes | Yes | Yes |
RWKV-4-Raven-14B-v11x-Eng99-20230501-ctx8192 |
Split | Yes | Yes | No | Yes | Yes | Yes |
Feel free to create a discussion if you have a request.