Benjamin Spector
Benjamin Spector
@amirakazadeh I had the same issue for a moment when my indenting was wrong. (The indenting provided in the previous answer does not match that of the repository.) Perhaps you...
@tbenthompson and I were just talking about another alternative which could work. Essentially, the idea would be to create a third, outermost loop in which one samples a tractable column...
ugh the annoying thing about fp8 is that the transpose instructions don't work for 8-bit types -- IDK why NVIDIA went only halfway on adding 8-bit hardware support. So FP8...
Sorry for delay. TL;DR is that ThunderKittens as-is will NOT run on pre-SM80, but it would not be very hard to modify it to support down to SM_75 with a...
The reason to do it would be if you happen to really like the TK programming model of working with tiles. But there are no tensor cores, so the MMA...
@ethxnp https://github.com/HazyResearch/ThunderKittens/blob/fp8/fp8-todo.md (This is on a very old branch, but I updated the todo just now so that it is correct for the current main/ branch. I would definitely fork...
That's definitely the right call, to reverse engineer however Cutlass handles it; I'm sure they do something sensible, and frankly that was our only hope of getting the WGMMA swizzling...
neither? we didn't really think it was worth dealing explicitly with the shared memory layout implied by ldmatrix/stmatrix, and doing it directly with swizzling seemed fast enough. So at the...
That is on the eventual to-do list!
looping in @Aaryan0404 for this