TransformerEngine icon indicating copy to clipboard operation
TransformerEngine copied to clipboard

When ub_overlap_rs_dgrad is set to True, the error "Caught signal 8 (Floating point exception: integer divide by zero)" is raised.

Open JJGSBGQ opened this issue 1 year ago • 2 comments

Setting ub_overlap_rs_dgrad to True in megatron-LM will raise "Caught signal 8 (Floating point exception: integer divide by zero) "error, which was eventually found to be caused by a problem with the tex.gemm calculation in the backward.

196DF0A7-6E1C-43f2-B3E9-86FE4AEA76D3

image

JJGSBGQ avatar Apr 17 '24 02:04 JJGSBGQ

@minitu

JJGSBGQ avatar Apr 17 '24 02:04 JJGSBGQ

+1

Ethan-yt avatar Jun 05 '24 09:06 Ethan-yt