Can we get RWKV model family support?
A very promising RNN based model, instead of transformers, claiming more lean memory consumption and faster inference
Here is the 14b card https://huggingface.co/BlinkDL/rwkv-4-pile-14b
That would be quite interesting
The lack of huggingface integration makes this challenging, but it should be possible.
Here is client code from the author https://github.com/BlinkDL/ChatRWKV
Here is the API for RWKV: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py Very simple to use :) Please join RWKV Discord if you have any questions
ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B :) ChatRWKV v2 API https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py
os.environ["RWKV_JIT_ON"] = '1'
from rwkv.model import RWKV # everything in /v2/rwkv folder
model = RWKV(model='/fsx/BlinkDL/HF-MODEL/rwkv-4-pile-1b5/RWKV-4-Pile-1B5-20220903-8040', strategy='cuda fp16')
out, state = model.forward([187, 510, 1563, 310, 247], None) # use 20B_tokenizer.json
print(out.detach().cpu().numpy()) # get logits
out, state = model.forward([187, 510], None)
out, state = model.forward([1563], state) # RNN has state (use deepcopy if you want to clone it)
out, state = model.forward([310, 247], state)
print(out.detach().cpu().numpy()) # same result as above
@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?
prompt = "What I would like to say is: "
completion = model_RWKV.generate(prompt, max_new_tokens=20)
print(completion)
That would be very helpful.
@BlinkDL is it possible to construct a wrapper to the model that simply does something like this?
prompt = "What I would like to say is: " completion = model_RWKV.generate(prompt, max_new_tokens=20) print(completion)That would be very helpful.
Sure. How about temp, top-p, top-k stuffs?
Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a model.generate function that was trivial to use for clueless people (like me).
Those things are nice, but not essential at first. FlexGen for instance has departed from HuggingFace classes but still provided a
model.generatefunction that was trivial to use for clueless people (like me).
updated https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py
def generate(prompt, max_new_tokens, state=None):
out = ''
all_tokens = []
for i in range(max_new_tokens):
out, state = model.forward(tokenizer.encode(prompt) if i == 0 else [token], state)
token = tokenizer.sample_logits(out, None, None, temperature=1.0, top_p=0.8)
all_tokens += [token]
tmp = tokenizer.decode(all_tokens)
if '\ufffd' not in tmp: # is it a valid utf-8 string?
out = tmp
return out
prompt = "What I would like to say is: "
print(prompt, end='')
completion = generate(prompt, max_new_tokens=20)
print(completion)
Note: it's slow (not optimized yet) when your prompt is long. Better keep it as short as possible (for now).
I added some tips to https://github.com/BlinkDL/ChatRWKV/blob/main/v2/api_demo.py
Very nice, @BlinkDL! Thank you for this. I will try incorporating it into the web UI later.
Now ChatRWKV can quickly preprocess the context if you set os.environ["RWKV_CUDA_ON"] = '1' before loading it :) It will use ninja to compile a cuda kernel. You can distribute the compiled kernels to those without compiler.
We have a first successful run: https://github.com/oobabooga/text-generation-webui/pull/149

I am very impressed with the coherence of the model so far, even though I have used the smallest version.
@BlinkDL I have a few questions:
- What are
alpha_frequencyandalpha_presencein terms of Hugging Face parameters? These are the parameters that I am using at the moment:
max_new_tokens, do_sample, temperature, top_p, typical_p, repetition_penalty, top_k, min_length, no_repeat_ngram_size, num_beams, penalty_alpha, length_penalty, early_stopping, eos_token
- Is there a way of adding the library to my project other than cloning the repository and importing from the
rwkvfolder?
For alpha_frequency and alpha_presence, see "Frequency and presence penalties": https://platform.openai.com/docs/api-reference/parameter-details
Please clone and import until I make the pip package :) Too busy at the moment, but I will do it.
The pip package is here :) https://pypi.org/project/rwkv/
Much easier with the pip package. Some more tweaks and I should be able to merge the PR.
I have merged the PR the RWKV models now work in the web UI. Some comments:
- I have made succesful tests with
RWKV-4-Pile-7B-20221115-8047.pthandRWKV-4-Pile-169M-20220807-8023.pth. I ran out of GPU memory withRWKV-4-Pile-14B-20230213-8019.pth(RTX 3090). - To install a model, just place .pth and the
20B_tokenizer.jsonfile inside themodelsfolder. - In order to make the process of loading the model more familar, I have created a RWKV module with the following methods:
RWKVModel.from_pretrainedandRWKVModel.generate. Source code: https://github.com/oobabooga/text-generation-webui/blob/main/modules/RWKV.py - Only the
top_pandtemperatureparameters can be changed in the interface for now. I am reluctant to addalpha_frequencyandalpha_presencesliders because these parameters are used exclusively by RWKV and this change would break the API. They are left hardcoded to the default values of 0.25. - It is possible to run the models in CPU mode with
--cpu. By default, they are loaded to the GPU. - The best way to try the models is with
python server.py --no-stream. That is, without--chat,--cai-chat, etc. - The current implementation should only work on Linux because the rwkv library reads paths as strings. It would be best to accept pathlib.Path variables too.
- After loading the model, make sure to lower the temperature to 0.5 to 1 for best results.
- The internal usage of the model is required some additional special handling because there is not an explicit tokenizer. In HF, the process is to take an input string, convert it into a sequence of tokens, generate, then convert back to a string.
This is the full code for the PR: https://github.com/oobabooga/text-generation-webui/pull/149/files
There is probably lots of room for improvement.
Cool. Please use rwkv==0.0.6 (fix a bug with temperature for CPU)
You can use "strategies" to load 14B on 3090. For example 'cuda fp16 *30 -> cpu fp32' [try increasing 30, for better speed, until you run out of VRAM]
Please use rwkv==0.0.6
Done. I have added a new flag to let users specify their custom strategies manually:
--rwkv-strategy RWKV_STRATEGY The strategy to use while loading RWKV models. Examples: "cpu fp32", "cuda fp16", "cuda fp16 *30 -> cpu fp32".
As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM.
My streaming implementation is probably really dumb and much slower than it could be:
for i in range(max_new_tokens//8):
reply = shared.model.generate(question, token_count=8, temperature=temperature, top_p=top_p)
yield formatted_outputs(reply, shared.model_name)
question = reply
It would be best to use the callback argument but I haven't figured out how yet.
@BlinkDL You said earlier ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14B
Yet @oobabooga said it went OOM on a 3090 (24GB of vram)
What am I missing?
@BlinkDL You said earlier
ChatRWKV v2: with "stream" and "split" strategies. 3G VRAM is enough to run RWKV 14BYet @oobabooga said it went OOM on a 3090 (24GB of vram) What am I missing?
see the post above you. "As you suggested, "cuda fp16 *30 -> cpu fp32" allowed me to get a response from the 14b model with 24GB VRAM."
If it replies faster/better than a regular 13b even with the split, it's still something. Plus the faster time to train. But I guess miracles we will not get.
Why not? RWKV is faster than regular 13b and better on many tasks.
@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs
@BlinkDL but its 24gb not 3gb, i really wanned to run that on 3080ti which only has 12gbs
"cuda fp16 *12 -> cpu fp32" [try increasing 12, for better speed, until you run out of VRAM]
Hi :) As I said before, [try increasing 30, for better speed, until you run out of VRAM]. @Ph0rk0z
Increase "30" in cuda fp16 *30 to compute more layers on your GPU.
I notice Pytorch is buggy on some CPUs where it can only use single CPU thread (so extremely slow). Are you using AMD CPUs?
In that case, try "cuda fp16 *30+" (notice the "+" symbol, and no cpu) to stream all layers on your GPU. Increase "30" for better speed, until you run out of VRAM.
Moreover, set os.environ["RWKV_CUDA_ON"] = '1' in https://github.com/oobabooga/text-generation-webui/blob/main/modules/RWKV.py for 10x speedup of reply time
It's purely pytorch issue because the CPU utilization is fine for most Intel CPUs and AMD server CPUs. I will ask pytorch guys.
Please see whether "cuda fp16 *29+" will be faster than "cuda fp16 *29 -> cpu fp32" in your case.
Yes and ctx4096 models run at same speed as ctx1024 models. I am finetuning 7B to ctx8k and 14B to 16k :)
I have been trying to compile the CUDA kernel with
os.environ["RWKV_CUDA_ON"] = '1'
but am getting this error:
gcc: fatal error: cannot execute 'cc1plus': execvp: No such file or directory
even though I have build-essential installed already.
Once I figure this out this I will write some proper documentation for RWKV in the wiki.
All in all this model handles 4096 context good enough. Maybe the limits should be raised.
RWKV-ctx4096 models can handle ctx4k :)
The difference between cuda and non-cuda is huge if you are running full model on GPU.
And it can be much faster after optimization. From someone: The current generation speed for RWKV 14B for a batch size of 1 and INT8 quantization (no quality loss compared to bf16) is 5.5 tokens/s on an EPYC 7313 CPU. On a RTX A6000 GPU it is 37 tokens/s with a max VRAM of 15 GB.