Varuna Jayasiri comments

Results 56 comments of


                                            Varuna Jayasiri

question about RoPE code

They have the similar shapes. The truncation of cached sin/cos to `x.shape[0]` is truncating them to sequence length. Because the sequence lengths (number of tokens per sample) changes.

RWKV

I will generate the HTML when you are ready. Thanks for the contribution!

RWKV

Sorry for the delay; I've been busy with work. I generated documentations and changed formatting a little. The generated docs are here: https://nn.labml.ai/RWKV/ I feel a a little more comments...

Network connection issues during training

Updated https://github.com/labmlai/labml with a fix (where it doesn't try to connect unless you explicitly provide a labml server URL).

do you have code for BERT?

MultiHeadAttention parameter setting

Our implementation assumes that heads * d_k = d_model. Need to change that

Bug in implementation of Rotary Positional Embeddings

Fixed it here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a Sorry for the delay

fix model error

Fixed it here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a

question about RoPE code

Fixed the test code here https://github.com/labmlai/annotated_deep_learning_paper_implementations/commit/2236f6383ce66bb25f1880512a4ad0ec8f37514a

question about RotaryPEMultiHeadAttention: rotary_percentage

I'm also not sure. I usually set it to 1. I have seen implementations where it's set to 0.5. I guess they do it so that some dimensions never get...