ThunderKittens icon indicating copy to clipboard operation
ThunderKittens copied to clipboard

Load with ldmatrix

Open liyanc opened this issue 9 months ago • 2 comments

Hello,

I'm curious if the implementation adopts the ldmatrix instruction for loading tiles from shared memory to registers. It seems the current version didn't implement load() with explicit ldmatrix per https://github.com/HazyResearch/ThunderKittens/blob/a562ed2569c45b0ffea844688594158cb7c6e858/src/ops/group/memory/tile/shared_to_register.cuh#L27. Will nvcc compile the function to ldmatrix or the authors intend to include ldmatrix in a future step?

liyanc avatar May 16 '24 01:05 liyanc

neither? we didn't really think it was worth dealing explicitly with the shared memory layout implied by ldmatrix/stmatrix, and doing it directly with swizzling seemed fast enough. So at the moment, no plans to add. But if we're wrong on this point and it would meaningfully unlock some more performance, could be persuaded.

benjaminfspector avatar May 16 '24 02:05 benjaminfspector

ldmatrix can refer to loading a 16x16 matrix with a single instruction, while LDS.32 requires 4 instructions, and ldmatrix also offers a transposition function.

luliyucoordinate avatar May 16 '24 13:05 luliyucoordinate