Andrej
Andrej
It makes me a bit uncomfortable that this fails when V = 0. But when V = 0 something went really wrong. I'll probably merge a variation of this ty
Very cool! @leloykun could it make sense to maintain both Flash Attention 1 and 2 separately? E.g. Flash Attention 2 as kernel4? I think having multiple versions is great /...
Another question: We do eventually want to implement the backward pass for all of these. Should we not leave the variable `l` intact w.r.t. these future plans?
What is the benefit of the online softmax for us?
On my A100 I am seeing: kernel 4: ``` block_size 32 | time 0.221143 ms block_size 64 | time 0.096894 ms block_size 128 | time 0.069505 ms block_size 256 |...
Good idea, I tried to turn it on. I haven't used this before, let me know if it looks wrong somehow.
Oh wow, ok I see. So you're autogenerating the code directly. (1) As (2) I was thinking we would save the Tokenizer in some serialized format and then load it...
Another thing to keep in mind is this will bloat the repo size because we codegen the vocab into a root .c file. Still leaning to (2) but will sleep...
I'll take a look. I don't want to bloat the repo with intermediate files though, so I'd suggest we don't push the vocab to repo, only the script that generates...
Tbh this looks complex. Just naively, shouldn't we just dump the UTF-8 encoded vocab into one single long byte sequence, delimited by \0, and just read that in in C