What does this PR do?

Adds the model from issue Fixes # (https://github.com/huggingface/transformers/issues/20737)

Before submitting

[ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[ ] Did you read the contributor guideline, Pull Request section?
[ ] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
[ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
[ ] Did you write any new necessary tests?

Who can review?

@younesbelkada @ArthurZucker

Dec 17 '22 20:12 ArEnSc

Hi @ArEnSc ! Thanks for starting over the PR 💪 Let us know whenever you need help with @ArthurZucker !

Dec 19 '22 09:12 younesbelkada

Hi @ArEnSc ! Thanks for starting over the PR 💪 Let us know whenever you need help with @ArthurZucker !

Will do still doing some research, just figured out how the training notebook works, model executes in notebook so that's a positive

Dec 19 '22 19:12 ArEnSc

Update: tracing the model and came up with a state based api for the RNN inference mode on my own code base to experiment with

Jan 05 '23 14:01 ArEnSc

Thanks a lot for the status update! Feel free to ping whenever you need help

Jan 05 '23 14:01 younesbelkada

Sometimes I look at working on this a little. Here are my notes and possible tasks, started 2023-01-16.

The template appears to be from a T5 style model. The RWKV state could be the encoder hidden state (a little intuitive) and/or the past key values (normative generation). It will take some algebra and tests to add input state to the GPT training form from the RNN inference form.
[ ] The tensorflow loading code appears complicating to me. I might move it out to another file for now.
[ ] The embeddings can likely be adjusted to reflect parts "i" and "ii" of the high level outline below
[ ] It could be helpful to organize the file to retain layout similarity with blinkdl’s files.
[ ] For below outline, next step is reviewing timemix. Draft of architecture (maybe leave out optional parts to start).

High level:
1. word embeddings emb
2. layernorm ln0 - optional 2-axis trained position embeddings seen in training code for image modeling pos_emb_x pos_emb_y. this is converted to 1-axis pos_emb and used prior to ln0 in inference.
3. layers of blocks 1. layernorm ln1 2. timemix self attention time_mix_k, time_mix_v, time_mix_r, time_first, time_decay, key, value, receptance, output. time_first and time_decay are kept as float32 in inference. 3. layernorm ln2 4. feedforward channelmix time_mix_k, time_mix_r, key, value, receptance (see channelmix section below) - timemix self attention optionally replaced with feedforward channelmix for block 0 in training code - for one optional block, tiny attention tiny_ln, tiny_q, tiny_k, tiny_v, tiny_mask seen in training code, inference code in development - optionally inference code uses what looks like a numeric stability trick to extract a factor of 2 from the weights every 6 layere
4. layernorm ln_out - optional "copy" attention head_q, head_k, copy_mask then summed to head in training code, inference code in development
5. linear language modeling head - for training loss, blink presently has a function after cross entropy called L2Wrap to reduce magnitudes
GPT(training) and RNN (inference) equivalence:
- i think special training initialization values may be used in timemix, channelmix
- for inference time_decay = -exp(time_decay) is factored out when loaded, but for training this is done in the forward pass.
- 5 state elements per layer:
  - 0 = ChannelMix/FF xx
  - 1 = TimeMix/SA xx
  - 2 = aa
  - 3 = bb
  - 4 = pp in inference, o in training
TimeMix:
1. the previous state is shifted into the x vector to make xx. in training this is done by "time shifting" with nn.ZeroPad2d((0, 0, 1, -1)); in single token inference it is passed as state element 1, which is then replaced by x.
2. linear interpolation between the old state xx and the new state x, weighting x by a ratio of time_mix_k, time_mix_v, and time_mix_r to make xk, xv, and xr respectivly.
3. k = key @ xk
4. v = value @ xv
5. sr = sigmoid(receptance @ xr) # called simply r in inference code
- the GPT training form of this is now handed off to a hand-written cuda kernel, compiled on first run, from cuda/wkv_cuda.cu
  - kernel parameters: B = batchsize; T = sequence length; C = channel count; _w = time_decay; _u = time_first; _k = k; _v = v; _y = wkv.
  - i think this used to be a convolution; i'm not sure whether it still is
  - o and no appear to be running values for magnitude management in exponential space, initialized to -1e38; p and q are initialized to 0
  - k and v are indexed by thread so the token offset may represent different subregions. i'm not quite clear on that and should test or ask.
  1. no = max(o, time_first[channel] + k[token])
  2. A = exp(o - no) # this is e1 in the RNN form
  3. B = exp(time_first[channel] + k[token] - no) # this is e2 in RNN
  4. wkv[token] = (A * p + B * v[token]) / (A * q + B)
  5. no = max(time_decay[channel] + o, k[token])
  6. A = exp(time_decay[channel] + o - no)
  7. B = exp(k[token] - no)
  8. p = A * p + B * v[token]
  9. q = A * q + B
  10. o = no; token += 1
- ... here would be the remaining core algebra and code inspection
- WIP unified summary of wkv kernel between inference and training:
  1. ww = time_first + k[token]
  2. next_pp = max(pp, ww)
  3. A = exp(pp - next_pp ...
- rwkv = sr * wkv
- return output @ rwkv
ChannelMix:
1. the previous state is shifted into the x vector to make xx. in training this is done by "time shifting" with nn.ZeroPad2d((0, 0, 1, -1)); in single token inference it is passed as state element 0, which is then replaced by x.
2. linear interpolation between the old state xx and the new state x, weighting x by a ratio of time_mix_k and time_mix_r to make xk and xr respectivly.
3. r = sigmoid(receptance @ xr)
4. k = square(relu(key @ xk))
5. kv = value @ k
6. rkv = r * kv
7. return rkv
[ ] review or improve model file further

Jan 16 '23 14:01 xloem

@ArEnSc do you need any help?

Jan 17 '23 12:01 Lundez

@ArEnSc do you need any help?

if you want to help pm me! on discord, otherwise I should have something end of week minor update

Jan 17 '23 18:01 ArEnSc

Hi @ArEnSc, Can you share with us your discord handle? Thanks!

Jan 23 '23 09:01 younesbelkada

Hi @ArEnSc, Can you share with us your discord handle? Thanks!

ARENSC#5905 yeah still working on it haha it will be a while

Jan 23 '23 15:01 ArEnSc

Working on having GPT Encoder to generate the context and RNN mode inference and sharing weights

Jan 30 '23 03:01 ArEnSc

Deleted a bunch of not needed stuff

Jan 30 '23 03:01 ArEnSc

Added the [WIP] Label to prevent the bot from coming back 😉

Mar 15 '23 15:03 ArthurZucker

@ArEnSc Please let us know if you won't have time to finish this PR. The model is heavily requested as you may see from the linked issue, do you want us to take over this PR and finish this?

Apr 11 '23 19:04 sgugger

@ArEnSc Please let us know if you won't have time to finish this PR. The model is heavily requested as you may see from the linked issue, do you want us to take over this PR and finish this?

Sure yes, sorry been busy at the hospital these days! I think it's probably important that you guys take this on =)

Apr 12 '23 06:04 ArEnSc

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 06 '23 15:05 github-actions[bot]

transformers
transformers copied to clipboard

[WIP] RWKV4Neo the RNN and GPT Hybrid Model

What does this PR do?

Before submitting

Who can review?

transformers transformers copied to clipboard

[WIP] RWKV4Neo the RNN and GPT Hybrid Model

What does this PR do?

Before submitting

Who can review?

transformers
transformers copied to clipboard