Jin Huang issues

Repositories
Issues
Comments

Results 3 issues of


                                            Jin Huang

Is NATTEN Fused NA v0.17 faster than Flash Attention 2

I don't see backward speedup using NATTEN, even with only half size as kernel size when calling na3d(). I'm not sure if it's as expected. Could anyone help to clarify...

fp8 not enabled for mha_varlen_fwd

I created an issue earlier. https://github.com/Dao-AILab/flash-attention/issues/1157. https://github.com/Dao-AILab/flash-attention/blob/main/hopper/flash_api.cpp#L447. I think the kernels are unified. Why is fp8 enabled for mha_fwd but not for mha_varlen_fwd? What's the blocker now? I'm willing to...

Where is the code about "remaining layers use faster half precision accumulate"?

`Flux diffusion model implementation using quantized fp8 matmul & remaining layers use faster half precision accumulate, which is ~2x faster on consumer devices.` Hello there! Thanks for sharing your quantization...