triton icon indicating copy to clipboard operation
triton copied to clipboard

Different configs will give different outputs for the same kernel

Open pyjhzwh opened this issue 2 years ago • 1 comments

repro: https://gist.github.com/pyjhzwh/a19de7882aff600ee4472398b3017758

kernel 0 basically do matmul, then multiply the results by 1.0, the store it back to output buffer. buf1 is the output buffer of kernel0 given the config as (BLOCK_M=64, BLOCK_N=256, BLOCK_K=32, SPLIT_K=1,num_stages=4,num_warps=4) buf2 is the output buffer of kernel0 given the config as (BLOCK_M=128, BLOCK_N=256, BLOCK_K=32, SPLIT_K=1,num_stages=3,num_warps=8)

But buf1 and buf2 outputs mismatch. The only difference is configuration. Buf2 should be the correct result. I feel that there is a bug in triton. Could you take a look? Thanks!

pyjhzwh avatar Aug 09 '22 20:08 pyjhzwh

You are right. This seems like a bug with non-TF32 float32 inputs (!). The results match with allow_tf=True or by making the inputs float16. I'm sure it has to do with how triton re-coalesces results before writing them back to DRAM, when tensor cores aren't used.

ptillet avatar Aug 09 '22 20:08 ptillet