Andrej comments

Results 373 comments of


                                            Andrej

Fixed a TODO to calculate the max value neatly and use inv sum trick

It makes me a bit uncomfortable that this fails when V = 0. But when V = 0 something went really wrong. I'll probably merge a variation of this ty

Speedup `attention_forward_kernel2` by implementing Flash Attention 2 kernel

Very cool! @leloykun could it make sense to maintain both Flash Attention 1 and 2 separately? E.g. Flash Attention 2 as kernel4? I think having multiple versions is great /...

Speedup `attention_forward_kernel2` by implementing Flash Attention 2 kernel

Another question: We do eventually want to implement the backward pass for all of these. Should we not leave the variable `l` intact w.r.t. these future plans?

Include the online softmax CPU code and a fully parallelized GPU kernal

What is the benefit of the online softmax for us?

Include the online softmax CPU code and a fully parallelized GPU kernal

On my A100 I am seeing: kernel 4: ``` block_size 32 | time 0.221143 ms block_size 64 | time 0.096894 ms block_size 128 | time 0.069505 ms block_size 256 |...

[Suggestion] Discussions tab for general help

Good idea, I tried to turn it on. I haven't used this before, let me know if it looks wrong somehow.

Add `decode_gpt2.c` for decoding in C

Oh wow, ok I see. So you're autogenerating the code directly. (1) As (2) I was thinking we would save the Tokenizer in some serialized format and then load it...

Add `decode_gpt2.c` for decoding in C

Another thing to keep in mind is this will bloat the repo size because we codegen the vocab into a root .c file. Still leaning to (2) but will sleep...

Add `decode_gpt2.c` for decoding in C

I'll take a look. I don't want to bloat the repo with intermediate files though, so I'd suggest we don't push the vocab to repo, only the script that generates...

Add `decode_gpt2.c` for decoding in C

Tbh this looks complex. Just naively, shouldn't we just dump the UTF-8 encoded vocab into one single long byte sequence, delimited by \0, and just read that in in C