xla icon indicating copy to clipboard operation
xla copied to clipboard

[NVIDIA GPU] Fully unroll windowed einsum loops to hide DUS overheads

Open Tixxx opened this issue 1 year ago • 6 comments

The windowed einsum loops used to be unrolled by a factor of 2 to achieve overlap between 2 gemms. But that leaves some of the dynamic update slices at the end to be exposed. This pr fully unrolls the loop so DUSes can be overlapped with independent gemms too. To avoid the loop being inlined by while loop simplifier pass, we add the attribute "skip-simplify-while-loops/trip-count-one=true" to the fully unrolled loop. Also a minor fix to while loop double buffering pass to skip unroll when trip count is 1.

Tixxx avatar Jul 09 '24 17:07 Tixxx