Varuna Jayasiri
Varuna Jayasiri
They have the similar shapes. The truncation of cached sin/cos to `x.shape[0]` is truncating them to sequence length. Because the sequence lengths (number of tokens per sample) changes.
I will generate the HTML when you are ready. Thanks for the contribution!
Sorry for the delay; I've been busy with work. I generated documentations and changed formatting a little. The generated docs are here: https://nn.labml.ai/RWKV/ I feel a a little more comments...
Updated https://github.com/labmlai/labml with a fix (where it doesn't try to connect unless you explicitly provide a labml server URL).
Our implementation assumes that heads * d_k = d_model. Need to change that
Fixed it here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a Sorry for the delay
Fixed it here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a
Fixed the test code here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a
I'm also not sure. I usually set it to 1. I have seen implementations where it's set to 0.5. I guess they do it so that some dimensions never get...