ThunderKittens icon indicating copy to clipboard operation
ThunderKittens copied to clipboard

Tile primitives for speedy kernels

Results 82 ThunderKittens issues
Sort by recently updated
recently updated
newest added

- What's the key improvement of TK compared to tensorrt? - Will TK provide some easy-to-use interfaces such as python wrapper?

I really like the simplicity of TK and think it could be broadly applicable to kernel authoring beyond attention. Has there been any benchmarking done of pure GEMM operations? If...

I've been working on porting FlashAttention-2 to pre-SM80 architectures (Turing and Volta) and was wondering if TK supports SM70 and SM75 hardware. Writing 100 lines of TK primitives sounds a...

Hello, I'm curious if the implementation adopts the `ldmatrix` instruction for loading tiles from shared memory to registers. It seems the current version didn't implement `load()` with explicit `ldmatrix` per...

Most recent models use hdim=128, it would be great to see that ThunderKittens also support that. https://github.com/HazyResearch/ThunderKittens/blob/a562ed2569c45b0ffea844688594158cb7c6e858/examples/attn/h100/h100_train_atn.py#L25-L26

Using the same random seed, the result of tk h100 attn_causal kernel vary with each run. In some cases, the max diff between tk and pytorch result can be larger...

AFAIK https://github.com/Dao-AILab/flash-attention/ did not have the bandwidth to support custom `attn_bias` (needed for relpos) - I think it's supported for the Triton version there, but I saw reports that it's...