Tri Dao
Results
473
comments of
Tri Dao
Yeah overlapping works reasonably well for Hopper, for older architectures it might be harder to do.
I don't think that's easy to measure. For small hdim (e.g. 64) and for fp8, softmax is still a bottleneck.
Yes I think that should work. You should test that still