RandMist
RandMist
> @RandMist @yaox12 Hello~ Our experiments found that after applying this change, the output can sometimes become **inf**. This subsequently leads to NaN gradients for the corresponding token during the...
> The goal of this kernel was to avoid saving the input for backward. The goal is to write the gradients on the input tensor itself to reduce the peak...
> @RandMist @yaox12 Hello~ Our experiments found that after applying this change, the output can sometimes become **inf**. This subsequently leads to NaN gradients for the corresponding token during the...