Vaios Laschos
Vaios Laschos
The new flash-attention has sliding window build in, however it doesnt stuck with compiling the model. So it is extremely easy to try it as it is but you will...
@VatsaDev Can you please give some references regarding your expectation about having more hallucination the more data you have? I understand that there are some heuristics (Chinchilla paper) about the...
Yes I already knew that. I was just thinking maybe it had to do with the size of the network. like some missed parameter. When you did the first run,...
I am not so sure what are you referring to because it is been a while. However if you like to have a quick chat over what you want to...
Not with the code as it is, but you can play with it (make the blocks depend on layer_id). The code is quite modular. Just be aware of the skip...