Dan Yao
Dan Yao
FlashAttentionV1: forward kloop: [gridwise_batched_mha_fwd_xdl_cshuffle_v1.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v1.hpp) backward kloop prototype1: [gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v1.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v1.hpp) backward kloop prototype2: [gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v2.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_kloop_v2.hpp) FlashAttentionV2: forward kloop: [gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_fwd_xdl_cshuffle_v2.hpp) backward qloop from bottom to top prototype1: [gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_v1.hpp](https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/mha-train-develop/include/ck/tensor_operation/gpu/grid/gridwise_batched_mha_bwd_xdl_cshuffle_qloop_b2t_v1.hpp) backward qloop from bottom to top...