flash-attention Does FlashAttention 2 use memory coalescing in Nvidia GPU?

Does FlashAttention 2 use memory coalescing in Nvidia GPU?

Open AndroidSheepy opened this issue 1 year ago • 1 comments

trafficstars

Dear developers of flash-attention, thank you for your great work.

Recently I used Nvidia Nsight Compute to profile the performence of FlashAttention 2 kernel(package version is v2.4.2, hardware is A100), and in the memory performance section of the generated report it says the memory access pattern is not optimal because of the uncoalesced memory read/load in FlashAttention kernel(You can check the sentences in the red box in the following photo ): issue1

Based on the results in the photo, I'm curious that does flash-attention 2 notice to use memory coalescing when accessing the GPU memory? It would be very helpful if you reply:).

May 06 '24 15:05 AndroidSheepy

Most of the reads/writes are coalesced. There are some small writes (e.g. writing to the LSE) that are not, but I don't think it matters. Lmk if you profile more and have more info on how much some of these un-coalesced access makes a difference.

May 06 '24 16:05 tridao

flash-attention flash-attention copied to clipboard

Does FlashAttention 2 use memory coalescing in Nvidia GPU?

flash-attention
flash-attention copied to clipboard