triton
triton copied to clipboard
[BACKEND] Consider mask alignment in memory coalescing
Mask alignment is considered during vectorization in ttgir lowering. This could cause inconsistency with the memory coalescing pass where mask alignment is not considered, which could lead to performance issue. I'm fixing this by having the coalescing pass also take the mask alignment into consideration.
For example, if the memory coalescing pass decides that one thread should go with 8 consecutive elements, but later on the vectorization pass rejects that because of the unaligned mask. Then the 8 consecutive elements will end up with 8 individual scalar loads, and each load access non-consecutive addresses across threads, breaking memory coalescing.
This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not I'll have to try to figure why it affects our kernels.
This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not I'll have to try to figure why it affects our kernels.
Thanks for giving it a shot! Interesting. Yes, I did a torch inductor benchmark run, it improves one benchmark in particular but neutral overall:
https://hud.pytorch.org/benchmark/dynamic/inductor_with_cudagraphs?startTime=Fri,%2022%20Mar%202024%2018:52:33%20GMT&stopTime=Fri,%2029%20Mar%202024%2018:52:33%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=hoy/triton-coalescing-vec&lCommit=14e03e560931d6353c4e129ac3e39cb342438615&rBranch=main&rCommit=a697d972b16c5c9c938f7422e1cb2dcec129ca97
I also see a little regression (about 4%) on one small kernel as discussed here: https://github.com/pytorch/pytorch/issues/122840#issuecomment-2027545288, I'm still figuring out. Codegen looks better though.
How much regression did you see?
This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not I'll have to try to figure why it affects our kernels.
Thanks for giving it a shot! Interesting. Yes, I did a torch inductor benchmark run, it improves one benchmark in particular but neutral overall:
https://hud.pytorch.org/benchmark/dynamic/inductor_with_cudagraphs?startTime=Fri,%2022%20Mar%202024%2018:52:33%20GMT&stopTime=Fri,%2029%20Mar%202024%2018:52:33%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=hoy/triton-coalescing-vec&lCommit=14e03e560931d6353c4e129ac3e39cb342438615&rBranch=main&rCommit=a697d972b16c5c9c938f7422e1cb2dcec129ca97
I also see a little regression (about 4%) on one small kernel as discussed here: pytorch/pytorch#122840 (comment), I'm still figuring out. Codegen looks better though.
Ok, would be interesting to track the regressions as I don't understand why those happen.
How much regression did you see?
The regressions was very large, > 50% which points to something bad like spilling or others. I need to look at the IR but probably won't get to it right away
This tanks some of our kernel's perf somehow :(. Did you run it on pytorch benchmarks? Is there no regressions on the workloads you have? If not I'll have to try to figure why it affects our kernels.
Thanks for giving it a shot! Interesting. Yes, I did a torch inductor benchmark run, it improves one benchmark in particular but neutral overall: https://hud.pytorch.org/benchmark/dynamic/inductor_with_cudagraphs?startTime=Fri,%2022%20Mar%202024%2018:52:33%20GMT&stopTime=Fri,%2029%20Mar%202024%2018:52:33%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=hoy/triton-coalescing-vec&lCommit=14e03e560931d6353c4e129ac3e39cb342438615&rBranch=main&rCommit=a697d972b16c5c9c938f7422e1cb2dcec129ca97 I also see a little regression (about 4%) on one small kernel as discussed here: pytorch/pytorch#122840 (comment), I'm still figuring out. Codegen looks better though.
Ok, would be interesting to track the regressions as I don't understand why those happen.
So in my kernel there are two loads and one store. Since the store does not access aligned addresses, it's never vectorized, thus same behavior before and after this diff. The only difference is for those two loads which have same access pattern and from NCU I see the L1 cache miss is dramatically high for the new version.
Let's look at one of them:
Before:
#blocked = #triton_gpu.blocked<{sizePerThread = [1, 8], threadsPerWarp = [1, 32], warpsPerCTA = [1, 8], order = [1, 0]}>
%25 = tt.load %24, %22, %cst_1 {cache = 1 : i32, evict = 3 : i32, isVolatile = false} : tensor<1x2048xf16, #blocked> loc(#loc28)
After:
#blocked = #triton_gpu.blocked<{sizePerThread = [1, 1], threadsPerWarp = [1, 32], warpsPerCTA = [1, 8], order = [1, 0]}>
%23 = tt.load %22, %20, %cst {cache = 1 : i32, evict = 3 : i32, isVolatile = false} : tensor<1x2048xf16, #blocked> loc(#loc10)
I suspect that since this is a fp16 load, loading 2 bytes per thread consecutively would cause a 64-byte memory transaction issued each time for a warp. On the contrary, loading 2 bytes in a strided way would cause a 128-byte memory transaction issued, which loses memory coalescing and is exactly being fixed by this patch. However, the 64 bytes out of the 128 bytes that are not used by the current load can be reused by next load instruction, if luckily enough that the cache is not polluted. I guess that's what happened. In the first case, the next load instruction would not benefit from caching.
I hacked the compiler to generate a sizePerThread = [1, 2]
encoding and the performance was back.
Does the theory sound reasonable to you, especially for the first case that a 64-byte instead of a 128-byte memory transaction is used?
@ThomasRaoux is there a way to share your benchmarks regressed by this patch for me to investigate? We see this could be a general fix and we would like to keep it going. Thanks!
@ThomasRaoux is there a way to share your benchmarks regressed by this patch for me to investigate? We see this could be a general fix and we would like to keep it going. Thanks!
Unfortunately I cannot do that.
I do think it is good patch in general but it would be great if you could fix the regression in torch benchmarks, then I can check if it fixes our internal workload
I do think it is good patch in general but it would be great if you could fix the regression in torch benchmarks, then I can check if it fixes our internal workload
Actually the latest version fixes the torch bench regression by ensuring a 32-byte memory transaction for loads even if they are not vectorizable. The loaded data that are not used by the current load instructions will be in cache for the next load instructions. This is motivated by the higher cache misses seen from the NCU profile of the regressed test run.
I do think it is good patch in general but it would be great if you could fix the regression in torch benchmarks, then I can check if it fixes our internal workload
Actually the latest version fixes the torch bench regression by ensuring a 32-byte memory transaction for loads even if they are not vectorizable. The loaded data that are not used by the current load instructions will be in cache for the next load instructions. This is motivated by the higher cache misses seen from the NCU profile of the regressed test run.
ok thanks. I'll give running that on our benchmarks another shot. It might be only next week though
I suspect that since this is a fp16 load, loading 2 bytes per thread consecutively would cause a 64-byte memory transaction issued each time for a warp. On the contrary, loading 2 bytes in a strided way would cause a 128-byte memory transaction issued, which loses memory coalescing and is exactly being fixed by this patch. However, the 64 bytes out of the 128 bytes that are not used by the current load can be reused by next load instruction, if luckily enough that the cache is not polluted. I guess that's what happened. In the first case, the next load instruction would not benefit from caching.
So the load is vectorized but not coalesced prior to the fix. The initial fix has sizePerThread = [1, 1], which means 2 bytes per thread and 64 bytes per warp. Prior to the fix, each thread accesses 16 bytes in a vectorized way?
Actually the latest version fixes the torch bench regression by ensuring a 32-byte memory transaction for loads even if they are not vectorizable. The loaded data that are not used by the current load instructions will be in cache for the next load instructions.
So the latest fix will generate sizePerThread=[1,2]?
So the load is vectorized but not coalesced prior to the fix. The initial fix has sizePerThread = [1, 1], which means 2 bytes per thread and 64 bytes per warp. Prior to the fix, each thread accesses 16 bytes in a vectorized way?
Before the fix, each thread still accessed 2 bytes because of the unaligned mask. And each thread did eight consecutive loads. This means neighboring thread did not load consecutive bytes such appeared to be wasting an 128-byte memory transaction. However, the performance wasn't bad and actually better and I suspect that the memory transaction was still beneficial, not to this round of load but next round, as the data could be cached.
So the latest fix will generate sizePerThread=[1,2]?
Yes.
@ThomasRaoux @htyu any news on this?
sorry looks like I dropped the ball on trying this on our benchmarks. Let me try that later today
I still see functional and perf failures in our internal tests. Did you expect this to be done? Unfortunately I didn't have time to check the reason for those
I still see functional and perf failures in our internal tests. Did you expect this to be done? Unfortunately I didn't have time to check the reason for those
Thanks for giving it a shot! Hmm, correctness issues are not expected. I'll see if I can repro it on my side.
closing this PR due to no activity. Feel free to reopen.