Natalia Gimelshein

Results 214 comments of Natalia Gimelshein

Also very curious where are CuTEDSL nvfp4 kernels available, the ones in https://github.com/NVIDIA/cutlass/blob/main/examples/python/CuTeDSL/blackwell/grouped_blockscaled_gemm.py#L2857 synchronize like there's no tomorrow (in the linked line). Also not sure if it supports dynamic K...

@pytorchbot merge

@pytorchbot merge -i

Theoretical (and practical) benefit of SR is that it provides unbiased gradient estimate. As for expensive computation - for bf16 it's still bandwidth-bound on H100, nvfp4 might be different as...