Y Song
Results
2
issues of
Y Song
the current linear attention can save a $KV$ state cache. This works when normalization is not enabled. When normalization is enabled. the output should be $\frac{QKV}{QK1}$. we can see that...
In the needle-in-a-haystack section of your paper, you mentioned: "However, linearizing with passkey samples (LoLCATs Llama 3 8B (Passkey)) recovers 100% accuracy." Does this step involving lora-finetuning with passkey samples?...