RandMist

Results 3 comments of RandMist

> @RandMist @yaox12 Hello~ Our experiments found that after applying this change, the output can sometimes become **inf**. This subsequently leads to NaN gradients for the corresponding token during the...

> The goal of this kernel was to avoid saving the input for backward. The goal is to write the gradients on the input tensor itself to reduce the peak...

> @RandMist @yaox12 Hello~ Our experiments found that after applying this change, the output can sometimes become **inf**. This subsequently leads to NaN gradients for the corresponding token during the...