TransformerEngine
TransformerEngine copied to clipboard
When ub_overlap_rs_dgrad is set to True, the error "Caught signal 8 (Floating point exception: integer divide by zero)" is raised.
Setting ub_overlap_rs_dgrad to True in megatron-LM will raise "Caught signal 8 (Floating point exception: integer divide by zero) "error, which was eventually found to be caused by a problem with the tex.gemm calculation in the backward.
@minitu
+1