flash-attention About the setting of bs_seqlen_vals in the benchmark_flash

I ran the benchmark_flash_attention.py on a RTX4080 and the code worked and promoted results as normal. Here I am curious about the variable bs_seqlen_vals, and in the code it was set to a number of pairs, (32, 512), (16, 1024), (8, 2048), (4, 4096), (2, 8192), (1, 16384). Is there any special consideration for doing so? For a larger batch size, would there be a decrease in performance in your tests before?

BTW, it might be more straightforward to use the TFLOPS/s by FA relative to the achievable TFLOPS of those tensor cores in presenting the strength of FA against its counterparts in inference/training.

Jun 20 '24 03:06 xwentian2020

Nothing special, you can set them however you like to benchmark. Typically for language modeling one would increase the seqlen and decrease the batch size to maintain the same number of tokens per batch.

Jun 20 '24 06:06 tridao

Hi Tridao,

Thank you for quick feedback.

In reading the cuda code, different implementations were made likely for difference use scenarios. As the design of some functions (e.g., flash_attn_func and flash_attn_varlen_func) is complicated, I am losing my way in understanding these functions. Could you give me a brief introduction to the above two functions in one or two sentences, so that I can have a rough understanding how to use these functions in their correct ways? When I used these different implementations in the benchmark_flash_attention.py script, I could make a rough idea about the results as well.

Jun 20 '24 06:06 xwentian2020

You can read the function doc string and and the tests https://github.com/Dao-AILab/flash-attention/blob/main/tests/test_flash_attn.py

Jun 20 '24 07:06 tridao

OK. Thanks for your suggestion.

Jun 20 '24 07:06 xwentian2020

About the setting of bs_seqlen_vals in the benchmark_flash_attention.py