LiYu Lu

Results 6 comments of LiYu Lu

@HaydenFaulkner I have the same problem. Have you solved it?

@microsoft-github-policy-service agree

2000万文本要训练多长时间啊?自己复现感觉用bpe要跑好久QAQ

I provided a [simple GEMM implementation](https://github.com/HazyResearch/ThunderKittens/pull/28), but a more optimized GEMM implementation requires support for ldmatrix and pipeline, which I haven't implemented yet.

ldmatrix can refer to loading a 16x16 matrix with a single instruction, while LDS.32 requires 4 instructions, and ldmatrix also offers a transposition function.