RWKV-LM
RWKV-LM copied to clipboard
Question about RWKV formula
In the first formula in README, RWKV is rewritten into recurrent form by letting $W_n=(n-1)w$. Is there a particular reason for using $n-1$ instead of $n$? The latter is more natural, and in From GPT to RWKV (the formulas) the recurrent formula of RWKV also implies the latter. So I believe you probably have tried it but for some reason it is suboptimal.
The $n-1$ format is more expressive. I only tried the current formula because I believe it's better.
Note I am treating W_0 differently.
That sounds reasonable because the $n-1$ form makes it possible for $\exp(K_i)V_i$ to appear in the expression of $O_{i+1}$. However, it would be better if someone could provide some empirical evidence. Thus I think it's better to leave this issue open for some time :-)