Vijay Thakkar

Results 81 comments of Vijay Thakkar

btw, I do not want to discourage you from using the 3.x API on Ampere, its totally kosher, we just recommend 2.x API for best performing Ampere kernels since they...

the alternative solution is to just launch two different kernels on two separate streams, which will likely give you equivalent or perhaps even better perf depending on the problem shapes...

For CUTLASS 3.x epilogues based on CuTe, its trivial to inject the coordinate from the collective epi into the thread functor. We already create the coordinate tensor for the purposes...

because shaving off 4 bytes to 1 byte for a single load per tile does not change the perf at all. Changing fp32 multiplication to int8 will also not move...

Although I doubt it, you can certainly try int8 alpha/beta to see if it would help in this case. What you would have to do is modify the epilogue thread...

Without more info than what you've given, all I can say is "yes". The int8 atoms exist for all archs

@rawnhenry are we missing a static assert somewhere in the collective for valid tile shapes?

A tile is a tuple of layouts. If you divide with a shape, that is equivalent to dividing with a tile of trivial layouts (layouts who have the same shape,...

Sounds like you want a grouped gemm that supports gather/scatter? Have you taken a look at example 52 for inspiration? Happy to help with the design, but using CuTe is...