About the setting of bs_seqlen_vals in the benchmark_flash_attention.py
I ran the benchmark_flash_attention.py on a RTX4080 and the code worked and promoted results as normal. Here I am curious about the variable bs_seqlen_vals, and in the code it was set to a number of pairs, (32, 512), (16, 1024), (8, 2048), (4, 4096), (2, 8192), (1, 16384). Is there any special consideration for doing so? For a larger batch size, would there be a decrease in performance in your tests before?
BTW, it might be more straightforward to use the TFLOPS/s by FA relative to the achievable TFLOPS of those tensor cores in presenting the strength of FA against its counterparts in inference/training.
Nothing special, you can set them however you like to benchmark. Typically for language modeling one would increase the seqlen and decrease the batch size to maintain the same number of tokens per batch.
Hi Tridao,
Thank you for quick feedback.
In reading the cuda code, different implementations were made likely for difference use scenarios. As the design of some functions (e.g., flash_attn_func and flash_attn_varlen_func) is complicated, I am losing my way in understanding these functions. Could you give me a brief introduction to the above two functions in one or two sentences, so that I can have a rough understanding how to use these functions in their correct ways? When I used these different implementations in the benchmark_flash_attention.py script, I could make a rough idea about the results as well.
You can read the function doc string and and the tests https://github.com/Dao-AILab/flash-attention/blob/main/tests/test_flash_attn.py
OK. Thanks for your suggestion.