Tri Dao
Tri Dao
Then you'd need to write down the attention matrix (you call it f) of size (batch, nheads, seqlen, seqlen). Memory will be O(seqlen^2) and I don't think it'll be better...
The amount of memory reads / writes to global memory will be O(seqlen^2) if i understand correctly. Then it's not very different from calling softmax on one row block of...
You'd still need to write `fi` of each block to global memory / HBM before calling matmul with V. So the total number of bytes written to global memory /...
There are 2 separate things: (1) total amount of memory required, and (2) total number of bytes written to memory. I agree that your approach would have (1) being subquadratic,...
This line `O_BLOCKS[i] = torch.matmul(local_attn,V)`. You'd need `local_attn` to be in HBM? And local attn has shape (block_size, seqlen), for each iteration, which means iteration you're writing down (block_size *...
Do you have enough space in SRAM to hold a tensor of size (block_size, seqlen)? Maybe if seqlen is not too long.
I'd recommend you writing the out algorithm and annotate which tensor lives on SRAM and which are written to HBM, then (1) check that you have enough SRAM space (2)...
There's not enough info here (there's no error message from the compilation log pointing to any specific line). You can try the recommended docker file from Nvidia.
Probably an issue with pytorch version. Can you try pytorch 2.0.0 or 2.1.0?
dropout_layer_norm is a separate extension. You don't have to use it.