cutlass
cutlass copied to clipboard
[QST]In tma_wasp producer's 128 threads, warp1 and warp3 are idle?
sm90_gemm_tma_warpspecialized_cooperative
enum class ProducerWarpRole {
Mainloop = 0,
Warp1 = 1,
Epilogue = 2,
Warp3 = 3
};
I find usage of Mainloop and Epilogue, but no usage of warp1 and 3?
By the way, I noticed if we use 320 threads(do not use warp1 3) the occupancy will be 10, and if we use 384 threads(original cutlass method) the occupancy will be 12. Maybe because of this?