triton
triton copied to clipboard
Different configs will give different outputs for the same kernel
repro: https://gist.github.com/pyjhzwh/a19de7882aff600ee4472398b3017758
kernel 0 basically do matmul, then multiply the results by 1.0, the store it back to output buffer. buf1 is the output buffer of kernel0 given the config as (BLOCK_M=64, BLOCK_N=256, BLOCK_K=32, SPLIT_K=1,num_stages=4,num_warps=4) buf2 is the output buffer of kernel0 given the config as (BLOCK_M=128, BLOCK_N=256, BLOCK_K=32, SPLIT_K=1,num_stages=3,num_warps=8)
But buf1 and buf2 outputs mismatch. The only difference is configuration. Buf2 should be the correct result. I feel that there is a bug in triton. Could you take a look? Thanks!
You are right. This seems like a bug with non-TF32 float32 inputs (!). The results match with allow_tf=True
or by making the inputs float16. I'm sure it has to do with how triton re-coalesces results before writing them back to DRAM, when tensor cores aren't used.