llama.cpp
llama.cpp copied to clipboard
llama: Add support for RWKV v7 architecture
@BlinkDL 's explanation of RWKV v7: RWKV-7 as a meta-in-context learner Also there are plenty of tests on trained models (currently 0.1B and 0.4B) posted on his x account. Larger models are coming too in several days.
Current available RWKV v7 model repos in HF format: https://huggingface.co/SmerkyG/RWKV7-Goose-0.1B-World2.8-HF (not an official published one, tensor names are expected to change in the future) https://huggingface.co/mollysama/rwkv-7-world-0b4-hf https://huggingface.co/mollysama/rwkv-7-world-1b5-hf https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1 (~~hybrid~~distilled model with rwkv v7 "attn" and qwen2.5 7B's mlp, distilled from qwen2.5) (it's not really appropriate to call them "hybrid" models because they actually doesn't have transformer attns)
Distilled DS-R1 models: https://huggingface.co/RWKV-Red-Team/ARWKV-R1-7B https://huggingface.co/RWKV-Red-Team/ARWKV-R1-1B5
This PR contains:
-
GGML_OP_L2_NORM
that applies pytorch-style l2 normalization, along the rows. Tested with CPU, CUDA, SYCL, Vulkan, Metal backends. -
GGML_OP_RWKV_WKV7
which is the core of the RWKV v7 architecture. Implemented the naive recurrent wkv7 kernel in CPU, CUDA, SYCL, Vulkan, Metal. - Support inference of RWKV7 and ARWKV7 models.
- Simple Metal kernel for the old WKV6.
- Skip unused tokens in last layer ffn computation for rwkv models. (8000tps -> 8100tps prefilling for 7B v7 model)
TODO: ~~- [ ] (within this PR or in the future) Implement chunkwise wkv7 (and possibly wkv6 as well) as per flash-linear-attention's impl.~~
Note: Current benchmark of ARWKV7-7B f16
# molly @ molly-workstation in ~/llama.cpp on git:rwkv-v7 x [9:49:42]
$ ./build-test/bin/llama-bench -m ../ARWKV-7B-Preview-0_1-NoG/ARWKV-7B-Preview-0_1-NoG-F16.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model | size | params | backend | ngl | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| arwkv7 7B F16 | 15.42 GiB | 8.27 B | CUDA | 99 | pp512 | 8105.20 ± 15.34 |
| arwkv7 7B F16 | 15.42 GiB | 8.27 B | CUDA | 99 | tg128 | 50.62 ± 0.01 |
build: 76219859 (4579)
which is way faster than RWKV v6 7B when prefilling (still a bit slower than Qwen2.5 7B).