Varuna Jayasiri

Results 56 comments of Varuna Jayasiri

They have the similar shapes. The truncation of cached sin/cos to `x.shape[0]` is truncating them to sequence length. Because the sequence lengths (number of tokens per sample) changes.

I will generate the HTML when you are ready. Thanks for the contribution!

Sorry for the delay; I've been busy with work. I generated documentations and changed formatting a little. The generated docs are here: https://nn.labml.ai/RWKV/ I feel a a little more comments...

Updated https://github.com/labmlai/labml with a fix (where it doesn't try to connect unless you explicitly provide a labml server URL).

Our implementation assumes that heads * d_k = d_model. Need to change that

Fixed it here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a Sorry for the delay

Fixed it here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a

Fixed the test code here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a

I'm also not sure. I usually set it to 1. I have seen implementations where it's set to 0.5. I guess they do it so that some dimensions never get...