LiYu Lu
LiYu Lu
@HaydenFaulkner I have the same problem. Have you solved it?
@microsoft-github-policy-service agree
2000万文本要训练多长时间啊?自己复现感觉用bpe要跑好久QAQ
I provided a [simple GEMM implementation](https://github.com/HazyResearch/ThunderKittens/pull/28), but a more optimized GEMM implementation requires support for ldmatrix and pipeline, which I haven't implemented yet.
ldmatrix can refer to loading a 16x16 matrix with a single instruction, while LDS.32 requires 4 instructions, and ldmatrix also offers a transposition function.