shunting314 comments

Results 56 comments of


                                            shunting314

Max-autotune template misaligned address

Not sure if the stack trace is accurate, right now it fail in torch.cuda.random module: https://gist.github.com/shunting314/bcb7b72aecac95c63d115058a111aa03 . The root cause should not be related to comprehensive padding since the padding...

Max-autotune template misaligned address

ok, here is the pure triton repro: https://gist.github.com/shunting314/cb04b62434ddedac0cc1ad5f6685f5c5 Looks like the triton matmul kernel does not support non-contiguous input tensors well. We get misaligned memory access even though the tensor...

Max-autotune template misaligned address

> Padding to 50304 makes sense but I think you may want to align M with stride_ak to ensure that pointer A moves in an aligned way. Hmm, this actually...

Max-autotune template misaligned address

Also changing M from 50257 to 50304 has the risk to do out of range memory access for tensor A and the output tensor.

Max-autotune template misaligned address

> Does the stride(1) give the correct element when moving down that direction? It looks like skipping 50304 elements is not going to give the same element on the next...

Max-autotune template misaligned address

Here is the code (https://gist.github.com/shunting314/cb04b62434ddedac0cc1ad5f6685f5c5#file-t-py-L76 ) ``` x = torch.empty_strided((50257, 32768), ((1, 50304)), dtype=torch.bfloat16, device='cuda') ``` that creates the X tensor. You can see that the stride(1) is 50304. It's...