Jin Huang

Results 3 issues of Jin Huang

I don't see backward speedup using NATTEN, even with only half size as kernel size when calling na3d(). I'm not sure if it's as expected. Could anyone help to clarify...

I created an issue earlier. https://github.com/Dao-AILab/flash-attention/issues/1157. https://github.com/Dao-AILab/flash-attention/blob/main/hopper/flash_api.cpp#L447. I think the kernels are unified. Why is fp8 enabled for mha_fwd but not for mha_varlen_fwd? What's the blocker now? I'm willing to...

`Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.` Hello there! Thanks for sharing your quantization...