Tri Dao
Tri Dao
Feel free to work on it if you need it.
Right, Ada architecture doesn't have WGMMA and TMA. FA2 might already be close to optimal for Ada architecture.
Ofc that's welcome. Depends on whether people want to contribute.
It's not commonly done. FA2 is already close to optimal on A100 (70% max theoretical FLOPS).
Warp-specialization will be difficult without the async features. Overlapping gemm and softmax would still be useful.
Thanks for the bug report, we've just fixed this. There was a mistake in the mapping between old and new parameter names that we've now fixed.
Please use triton >= 2.1.0
triton 2.1.0 should have cumsum. If not you can try >= 2.2.0
No we require tl.cumsum