shunting314
shunting314
Not sure if the stack trace is accurate, right now it fail in torch.cuda.random module: https://gist.github.com/shunting314/bcb7b72aecac95c63d115058a111aa03 . The root cause should not be related to comprehensive padding since the padding...
ok, here is the pure triton repro: https://gist.github.com/shunting314/cb04b62434ddedac0cc1ad5f6685f5c5 Looks like the triton matmul kernel does not support non-contiguous input tensors well. We get misaligned memory access even though the tensor...
> Padding to 50304 makes sense but I think you may want to align M with stride_ak to ensure that pointer A moves in an aligned way. Hmm, this actually...
Also changing M from 50257 to 50304 has the risk to do out of range memory access for tensor A and the output tensor.
> Does the stride(1) give the correct element when moving down that direction? It looks like skipping 50304 elements is not going to give the same element on the next...
Here is the code (https://gist.github.com/shunting314/cb04b62434ddedac0cc1ad5f6685f5c5#file-t-py-L76 ) ``` x = torch.empty_strided((50257, 32768), ((1, 50304)), dtype=torch.bfloat16, device='cuda') ``` that creates the X tensor. You can see that the stride(1) is 50304. It's...
> So torch.empty_strided automatically adds padding when stride and shape don't match? Will the padding be all zeros? If so changing M to 50304 should be safe? The padding can...
@htyu changing ``` ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M) ``` to ``` ram = rm % M ``` (i.e. removing the compiler hints) makes it work. Do you know...
> The cost would be to lose vectorizaiton as the memory accesses are not considered aligned or continous. But not doing padding would also lose vectorization since the tensor shape...
BTW, @htyu can you help me understand when it's safe to apply `tl.multiple_of` or `tl.max_contiguous` here? ``` ram = tl.max_contiguous(tl.multiple_of(rm % M, BLOCK_M), BLOCK_M) ``` I think we can apply...