Fix potential smem misaligned address issue for ws pingpong kernel.
Current implementation which puts epilogue before mainloop in SharedStorage could cause smem misaligned address issue when using tma load and smem size of epilogue is not 128B aligned. Reverse the order to make sure smem address of mainloop is 128B aligned.
I've seen this misaligned address error as well. Same with the WS cooperative kernel.
Reversing the order of epilogue and mainloop will cause performance issue for pingpong kernel and cooperative kernel. Will seek another way to solve this issue.
@tridao Yes, they are the same issue. We are communicating with our compiler team colleague to solve the issue.