text-generation-webui Can we get RWKV model family support?

A very promising RNN based model, instead of transformers, claiming more lean memory consumption and faster inference

Here is the 14b card https://huggingface.co/BlinkDL/rwkv-4-pile-14b

Feb 21 '23 08:02 ye7iaserag

That would be quite interesting

Feb 21 '23 19:02 Slug-Cat

The lack of huggingface integration makes this challenging, but it should be possible.

Feb 22 '23 00:02 oobabooga

Here is client code from the author https://github.com/BlinkDL/ChatRWKV

Feb 22 '23 19:02 ye7iaserag

Here is the API for RWKV: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py Very simple to use :) Please join RWKV Discord if you have any questions

Feb 23 '23 12:02 BlinkDL

ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B :) ChatRWKV v2 API https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

os.environ["RWKV_JIT_ON"] = '1'
from rwkv.model import RWKV                         # everything in /v2/rwkv folder
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')

out, state = model.forward([187, 510, 1563, 310, 247], None)   # use 20B_tokenizer.json
print(out.detach().cpu().numpy())                   # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state)           # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy())                   # same result as above

Feb 23 '23 19:02 BlinkDL

@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?

prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)

That would be very helpful.

Feb 23 '23 19:02 oobabooga

@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?
prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)
That would be very helpful.

Sure. How about temp, top-p, top-k stuffs?

Feb 23 '23 19:02 BlinkDL

Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).

Feb 23 '23 20:02 oobabooga

Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).

updated https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

def generate(prompt, max_new_tokens, state=None):
    out = ''
    all_tokens = []
    for i in range(max_new_tokens):
        out, state = model.forward(tokenizer.encode(prompt) if i == 0 else [token], state)
        token = tokenizer.sample_logits(out, None, None, temperature=1.0, top_p=0.8)
        all_tokens += [token]
        tmp = tokenizer.decode(all_tokens)
        if '\ufffd' not in tmp: # is it a valid utf-8 string?
            out = tmp
    return out

prompt = "What I would like to say is: "
print(prompt, end='')
completion = generate(prompt, max_new_tokens=20)
print(completion)

Note: it's slow (not optimized yet) when your prompt is long. Better keep it as short as possible (for now).

Feb 23 '23 20:02 BlinkDL

I added some tips to https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py

Feb 23 '23 21:02 BlinkDL

Very nice, @BlinkDL! Thank you for this. I will try incorporating it into the web UI later.

Feb 23 '23 21:02 oobabooga

Now ChatRWKV can quickly preprocess the context if you set os.environ["RWKV_CUDA_ON"] = '1' before loading it :) It will use ninja to compile a cuda kernel. You can distribute the compiled kernels to those without compiler.

Feb 25 '23 06:02 BlinkDL

We have a first successful run: https://github.com/oobabooga/text-generation-webui/pull/149

first-run

I am very impressed with the coherence of the model so far, even though I have used the smallest version.

@BlinkDL I have a few questions:

What are alpha_frequency and alpha_presence in terms of Hugging Face parameters? These are the parameters that I am using at the moment:

max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, eos_token

Is there a way of adding the library to my project other than cloning the repository and importing from the rwkv folder?

Feb 28 '23 02:02 oobabooga

For alpha_frequency and alpha_presence, see "Frequency and presence penalties": https://platform.openai.com/docs/api-reference/parameter-details

Please clone and import until I make the pip package :) Too busy at the moment, but I will do it.

Feb 28 '23 06:02 BlinkDL

The pip package is here :) https://pypi.org/project/rwkv/

Mar 01 '23 08:03 BlinkDL

Much easier with the pip package. Some more tweaks and I should be able to merge the PR.

Mar 01 '23 15:03 oobabooga

I have merged the PR the RWKV models now work in the web UI. Some comments:

I have made succesful tests with RWKV-4-Pile-7B-20221115-8047.pth and RWKV-4-Pile-169M-20220807-8023.pth. I ran out of GPU memory with RWKV-4-Pile-14B-20230213-8019.pth (RTX 3090).
To install a model, just place .pth and the 20B_tokenizer.json file inside the models folder.
In order to make the process of loading the model more familar, I have created a RWKV module with the following methods: RWKVModel.from_pretrained and RWKVModel.generate. Source code: https://github.com/oobabooga/text-generation-webui/blob/main/modules/RWKV.py
Only the top_p and temperature parameters can be changed in the interface for now. I am reluctant to add alpha_frequency and alpha_presence sliders because these parameters are used exclusively by RWKV and this change would break the API. They are left hardcoded to the default values of 0.25.
It is possible to run the models in CPU mode with --cpu. By default, they are loaded to the GPU.
The best way to try the models is with python server.py --no-stream. That is, without --chat, --cai-chat, etc.
The current implementation should only work on Linux because the rwkv library reads paths as strings. It would be best to accept pathlib.Path variables too.
After loading the model, make sure to lower the temperature to 0.5 to 1 for best results.
The internal usage of the model is required some additional special handling because there is not an explicit tokenizer. In HF, the process is to take an input string, convert it into a sequence of tokens, generate, then convert back to a string.

This is the full code for the PR: https://github.com/oobabooga/text-generation-webui/pull/149/files

There is probably lots of room for improvement.

Mar 01 '23 20:03 oobabooga

Cool. Please use rwkv==0.0.6 (fix a bug with temperature for CPU)

You can use "strategies" to load 14B on 3090. For example 'cuda fp16 *30 -> cpu fp32' [try increasing 30, for better speed, until you run out of VRAM]

Mar 01 '23 20:03 BlinkDL

Please use rwkv==0.0.6

Done. I have added a new flag to let users specify their custom strategies manually:

  --rwkv-strategy RWKV_STRATEGY             The strategy to use while loading RWKV models. Examples: "cpu fp32", "cuda fp16", "cuda fp16 *30 -> cpu fp32".

As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM.

Mar 01 '23 23:03 oobabooga

My streaming implementation is probably really dumb and much slower than it could be:

            for i in range(max_new_tokens//8):
                reply = shared.model.generate(question, token_count=8, temperature=temperature, top_p=top_p)
                yield formatted_outputs(reply, shared.model_name)
                question = reply

It would be best to use the callback argument but I haven't figured out how yet.

Mar 01 '23 23:03 oobabooga

@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B Yet @oobabooga said it went OOM on a 3090 (24GB of vram) What am I missing?

Mar 02 '23 06:03 ye7iaserag

@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B Yet @oobabooga said it went OOM on a 3090 (24GB of vram) What am I missing?

see the post above you. "As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM."

Mar 02 '23 06:03 BlinkDL

If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get.

Why not? RWKV is faster than regular 13b and better on many tasks.

Mar 02 '23 16:03 BlinkDL

@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs

Mar 02 '23 21:03 ye7iaserag

@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs

"cuda fp16 *12 -> cpu fp32" [try increasing 12, for better speed, until you run out of VRAM]

Mar 03 '23 02:03 BlinkDL

Hi :) As I said before, [try increasing 30, for better speed, until you run out of VRAM]. @Ph0rk0z

Increase "30" in cuda fp16 *30 to compute more layers on your GPU.

I notice Pytorch is buggy on some CPUs where it can only use single CPU thread (so extremely slow). Are you using AMD CPUs?

In that case, try "cuda fp16 *30+" (notice the "+" symbol, and no cpu) to stream all layers on your GPU. Increase "30" for better speed, until you run out of VRAM.

Mar 03 '23 07:03 BlinkDL

Moreover, set os.environ["RWKV_CUDA_ON"] = '1' in https://github.com/oobabooga/text-generation-webui/blob/main/modules/RWKV.py for 10x speedup of reply time

Mar 03 '23 07:03 BlinkDL

It's purely pytorch issue because the CPU utilization is fine for most Intel CPUs and AMD server CPUs. I will ask pytorch guys.

Please see whether "cuda fp16 *29+" will be faster than "cuda fp16 *29 -> cpu fp32" in your case.

Yes and ctx4096 models run at same speed as ctx1024 models. I am finetuning 7B to ctx8k and 14B to 16k :)

Mar 03 '23 15:03 BlinkDL

I have been trying to compile the CUDA kernel with

os.environ["RWKV_CUDA_ON"] = '1'

but am getting this error:

gcc: fatal error: cannot execute 'cc1plus': execvp: No such file or directory

even though I have build-essential installed already.

Once I figure this out this I will write some proper documentation for RWKV in the wiki.

Mar 04 '23 00:03 oobabooga

All in all this model handles 4096 context good enough. Maybe the limits should be raised.

RWKV-ctx4096 models can handle ctx4k :)

The difference between cuda and non-cuda is huge if you are running full model on GPU.

And it can be much faster after optimization. From someone: The current generation speed for RWKV 14B for a batch size of 1 and INT8 quantization (no quality loss compared to bf16) is 5.5 tokens/s on an EPYC 7313 CPU. On a RTX A6000 GPU it is 37 tokens/s with a max VRAM of 15 GB.

Mar 05 '23 05:03 BlinkDL