Results 123 comments of Xin Yao

Ready for review. The CI failures are irrelevant. @nvMelissa @timmoon10 @zhongbozhu

From the log, the Flash Attention backend is disabled because you set `NVTE_FLASH_ATTN=0`, while the cuDNN attention backend is disabled because input is not supported. So a quick fix is...

@sanandaraj5597 @timmoon10 Could you please review? The previous BF16 backward may lead to divergence in some cases (reported by several customers).

@RandMist You need to sign-off your commits (`git commit -s`). See [this](https://github.com/NVIDIA/TransformerEngine/pull/2325/checks?check_run_id=54164971208).

> My understanding is that we have control of both edges between kernels, we can modify the launch of the current kernel with `cudaLaunchKernelEx` and we can modify if the...