[QST] possible performance bug due to disabling inlining for ldmatrix
I recently noticed that all inlining for ldmatrix instructions on ldsm.h was disabled on this pr: https://github.com/tile-ai/tilelang/pull/227. Whenever ldmatrix on ldsm.h is invoked, nvcc is not able to keep matrices in register memory and moves them to local memory, which can cause a significant slowdown. When tilelang calls ldmatrix on the cute end this does not happen since it is inlined. What was the purpose of disabling inlining for ldmatrix instructions on this pr?
Specifically, this is what happens when ldmatrix is called on the tilelang end (non-inlined):
ldmatrix.sync.aligned.x4.trans.m8n8.shared.b16 {%r1, %r2, %r3, %r4}, [%r5];
st.local.u32 [%rd3], %r1;
st.local.u32 [%rd3+4], %r2;
st.local.u32 [%rd3+8], %r3;
st.local.u32 [%rd3+12], %r4;
Thanks in advance!
@davidpissarra Sorry for the late reply — this is a really interesting issue.
We disabled ldmatrix inlining because it looks like there’s a bug in nvcc. When we inline ldmatrix_x4, in some cases the compiler applies an incorrect optimization.
For example, the intended code is:
reg[32];
ldmatrix_x4(®[0], ®[8], ®[16], ®[24]);
However, under certain circumstances nvcc will optimize this to:
reg[4];
ldmatrix_x4(®[0], ®[1], ®[2], ®[3]);
This incorrect optimization changes the register usage, which can break the program's correctness evaluation.
It only happens on kernels that cute don’t have those patterns
Though the problem is interesting, would you mind performing a benchmark on a GEMM RS to check whether we can observe local memory reads and writes? @yyttt6
Hi @davidpissarra thank you very much for raising this issue. Here are my experimental results:
In my experiments, I did not observe any difference in local memory accesses when switching between the non-inlined and the force-inlined ldmatrix versions. However, I did notice a small performance improvement of about 0.1% with force inlining.
[Setup] GPU: RTX 4090 Benchmark: python example_gemm.py Problem size: m = n = k = 1024 Tile size: block_m = block_n = block_k = 128
Could you please share more details on how can we reproduce the local memory accesses issue?
Thanks again!
Thanks @davidpissarra , you're right, and which may lead to some kernel performance regression, now we recover the codes, closed :)