Tri Dao

Results 473 comments of Tri Dao

Yeah overlapping works reasonably well for Hopper, for older architectures it might be harder to do.

I don't think that's easy to measure. For small hdim (e.g. 64) and for fp8, softmax is still a bottleneck.

Yes I think that should work. You should test that still