llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

llama: Add support for RWKV v7 architecture

Open MollySophia opened this issue 1 month ago • 5 comments

@BlinkDL 's explanation of RWKV v7: RWKV-7 as a meta-in-context learner Also there are plenty of tests on trained models (currently 0.1B and 0.4B) posted on his x account. Larger models are coming too in several days.

Current available RWKV v7 model repos in HF format: https://huggingface.co/SmerkyG/RWKV7-Goose-0.1B-World2.8-HF (not an official published one, tensor names are expected to change in the future) https://huggingface.co/mollysama/rwkv-7-world-0b4-hf https://huggingface.co/mollysama/rwkv-7-world-1b5-hf https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1 (~~hybrid~~distilled model with rwkv v7 "attn" and qwen2.5 7B's mlp, distilled from qwen2.5) (it's not really appropriate to call them "hybrid" models because they actually doesn't have transformer attns)

Distilled DS-R1 models: https://huggingface.co/RWKV-Red-Team/ARWKV-R1-7B https://huggingface.co/RWKV-Red-Team/ARWKV-R1-1B5

This PR contains:

  • GGML_OP_L2_NORM that applies pytorch-style l2 normalization, along the rows. Tested with CPU, CUDA, SYCL, Vulkan, Metal backends.
  • GGML_OP_RWKV_WKV7 which is the core of the RWKV v7 architecture. Implemented the naive recurrent wkv7 kernel in CPU, CUDA, SYCL, Vulkan, Metal.
  • Support inference of RWKV7 and ARWKV7 models.
  • Simple Metal kernel for the old WKV6.
  • Skip unused tokens in last layer ffn computation for rwkv models. (8000tps -> 8100tps prefilling for 7B v7 model)

TODO: ~~- [ ] (within this PR or in the future) Implement chunkwise wkv7 (and possibly wkv6 as well) as per flash-linear-attention's impl.~~

Note: Current benchmark of ARWKV7-7B f16

# molly @ molly-workstation in ~/llama.cpp on git:rwkv-v7 x [9:49:42] 
$ ./build-test/bin/llama-bench -m ../ARWKV-7B-Preview-0_1-NoG/ARWKV-7B-Preview-0_1-NoG-F16.gguf -ngl 99
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
| model                          |       size |     params | backend    | ngl |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------: | -------------------: |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         pp512 |      8105.20 ± 15.34 |
| arwkv7 7B F16                  |  15.42 GiB |     8.27 B | CUDA       |  99 |         tg128 |         50.62 ± 0.01 |

build: 76219859 (4579)

which is way faster than RWKV v6 7B when prefilling (still a bit slower than Qwen2.5 7B).

MollySophia avatar Jan 27 '25 13:01 MollySophia