Hongtao Yu
Hongtao Yu
> BTW, @htyu can you help me understand when it's safe to apply `tl.multiple_of` or `tl.max_contiguous` here? > > ``` > ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M) > ```...
> Here is some perf test: > > 1. Changing stride_ak from 50304 to 50257 (i.e. cancel padding), perf is 27.69ms. In this case I think even if we can...
> rm % 50304 is not always in bound. The last row may not have extra items after it. The source tensor X does not have an out-of-bound issue but...