cutlass support for alignment != 8 and adding a new BMM example

fixed bugs and update verification logics.

removed verification for Max, making the verification logic more consistent: we don't check Sum, then we won't check Max.
fixed the correctness issue when n<128.
fixed the correctness issue when Alignment != 8
fixed the correctness issue for some special inputs.

Jul 28 '22 22:07 yzhaiustc

Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor when we print out the tensor value or set tolerance=1e-2.

Jul 30 '22 06:07 terrychenism

Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor when we print out the tensor value or set tolerance=1e-2.

Fixed. Thank you for your careful tests. Please let me know if there are any further issues.

Jul 30 '22 17:07 yzhaiustc

Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor when we print out the tensor value or set tolerance=1e-2.

Fixed. Thank you for your careful tests. Please let me know if there are any further issues.

Thanks for the quick fix! the latest commit solved the gemm_softmax inf/nan issue, and it works well for the M=16, K=64, N=10 case. But for N=82(epilogue_c = 2) / N = 81 (epilogue_c=1), and keep same M/K, the results are not perfectly matched, ~5% mismatch if tolerance=1e-2. I also tried BMM softmax, after the latest commit I also see the inf/nan issue.

Jul 30 '22 19:07 terrychenism

Thanks for fixing this issue. I tried it for the case(M=16, K=64, N=10), which needs alignment_c(softmax) = 2, then we still have inf/nan in the output softmax tensor when we print out the tensor value or set tolerance=1e-2.

Fixed. Thank you for your careful tests. Please let me know if there are any further issues.

Thanks for the quick fix! the latest commit solved the gemm_softmax inf/nan issue, and it works well for the M=16, K=64, N=10 case. But for N=82(epilogue_c = 2) / N = 81 (epilogue_c=1), and keep same M/K, the results are not perfectly matched, ~5% mismatch if tolerance=1e-2. I also tried BMM softmax, after the latest commit I also see the inf/nan issue.

Hi Terry, Thanks a lot for your great patience and interests in the softmax example! The previous mismatch & numerical errors are both due to the same issue --- the algorithm that I've implemented does not provide correct answers when there are more than 1 steps in the same row of epilogue. Now this has been fixed by adopting a new algorithm.

Regarding BMM -- I need some more time to get the functionality back.

Please message me at any time. Thank you!

Jul 30 '22 23:07 yzhaiustc

Thank you! The new algo works well on all of my current problem sizes! no numerical issue now. Do you have any timeline for bmm support? Look forward to it.

Jul 31 '22 02:07 terrychenism

Hopefully by tomorrow noon PST but not to guarantee. Let's how it's going :) Thank you for your patience!

Jul 31 '22 02:07 yzhaiustc

I've brought back the batching support. Please let me know if there are issues. Thank you :-)

Jul 31 '22 18:07 yzhaiustc

Tested with B=16, M=16, K=64, N=24, the result of first batch is correct, but from 2nd batch the output contain inf values. I set batch_stride_Max_ and batch_stride_Sum_ as M*N.

Jul 31 '22 19:07 terrychenism

Thank you. batch_stride_Max_ and batch_stride_Sum_ should be set to block_num * options.problem_size.m(), here block_num = (N + threadblock_tile_size_N - 1) / threadblock_tile_size_N, referring to https://github.com/yzhaiustc/cutlass/blob/yz/softmax-misalignment/examples/35_gemm_softmax/gemm_softmax.cu#L246 Would you please try these new stride sizes? Thanks a lot!

Jul 31 '22 19:07 yzhaiustc

still not working, before this PR https://github.com/NVIDIA/cutlass/pull/546 stride should be M*N, would be good if you can provide a example/code snippet for BMM.

i did a benchmark for fused bmm+softmax ( numerical does not match) and found fused version has some regression B = 16, M = 4096, N = 4096, K = 64 runtime ms: 1.6 ms bmm (cutlass) + standalone softmax runtime ms: 1.3ms

Jul 31 '22 21:07 terrychenism

Let me check with my mentor and update you regarding the bmm example. Will keep you posted. Thank you for your great patience :)

Jul 31 '22 21:07 yzhaiustc

This PR has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this PR if it is no longer required. Otherwise, please respond with a comment indicating any updates. This PR will be labeled inactive-90d if there is no activity in the next 60 days.

Sep 02 '22 17:09 github-actions[bot]

close because updates have been adopted in CUTLASS 2.10

Sep 06 '22 20:09 yzhaiustc