Tri Dao comments

Results 447 comments of


                                            Tri Dao

trafficstars

Fewer matrix multiplications, same results, should we consider adopting it?

Then you'd need to write down the attention matrix (you call it f) of size (batch, nheads, seqlen, seqlen). Memory will be O(seqlen^2) and I don't think it'll be better...

Fewer matrix multiplications, same results, should we consider adopting it?

The amount of memory reads / writes to global memory will be O(seqlen^2) if i understand correctly. Then it's not very different from calling softmax on one row block of...

Fewer matrix multiplications, same results, should we consider adopting it?

You'd still need to write `fi` of each block to global memory / HBM before calling matmul with V. So the total number of bytes written to global memory /...

Fewer matrix multiplications, same results, should we consider adopting it?

There are 2 separate things: (1) total amount of memory required, and (2) total number of bytes written to memory. I agree that your approach would have (1) being subquadratic,...

Fewer matrix multiplications, same results, should we consider adopting it?

This line `O_BLOCKS[i] = torch.matmul(local_attn,V)`. You'd need `local_attn` to be in HBM? And local attn has shape (block_size, seqlen), for each iteration, which means iteration you're writing down (block_size *...

Fewer matrix multiplications, same results, should we consider adopting it?

Do you have enough space in SRAM to hold a tensor of size (block_size, seqlen)? Maybe if seqlen is not too long.

Fewer matrix multiplications, same results, should we consider adopting it?

I'd recommend you writing the out algorithm and annotate which tensor lives on SRAM and which are written to HBM, then (1) check that you have enough SRAM space (2)...

pip install flash-attn --no-build-isolation failing

There's not enough info here (there's no error message from the compilation log pointing to any specific line). You can try the recommended docker file from Nvidia.

undefined symbol: _ZN3c104cuda9SetDeviceEi

Probably an issue with pytorch version. Can you try pytorch 2.0.0 or 2.1.0?

Failed to build dropout-layer-norm

dropout_layer_norm is a separate extension. You don't have to use it.