ziyu huang

Results 14 issues of ziyu huang

Hi! I have written a code for slicedK in GEMM, but it seems very slow....I tried to understand cutlass's slicedK, but can not understand it....So I post my code here...

question

Hi! I am learning 'tall' matmul and find it **hard to find the code** describing how slice K reduce the value.... I think, each wrap will calculate 32*64 values (each...

question

Hi! I am learning cutlass, and I see something like: (from official post) ```C++ /// CUTLASS SGEMM example __global__ void gemm_kernel(void gemm_kernel( float *C, float *C, float const *A, float...

question
inactive-30d

Hi! I am learning cutlass. And I read this post: [CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog](https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/) But I can not find official “dispatch_policies.h”, only find...

question
inactive-30d

Hi! I am using windows....And I have heard cutlass can be used on windows. Could you write some document to guide us? Thank you!!!

documentation
inactive-30d

樊老师,您好!我已经阅读完了您的书籍,现在在使用CUDA开发一种高性能的算法,能比NV的某官方库更快。不知道您是否有类似于读者群之类的HPC交流群?我是某985大学的硕士,应该也可以为社群做出贡献。多谢!

![8967D61FB388618CD624DFBA2BE54F35](https://user-images.githubusercontent.com/65449458/157788762-78c90707-5455-4407-8b83-d55cca668c89.png) 此处的FMA只提到了 d=a*b+c,具体是怎么算的?因为这种公式可能可以引用到论文中,或许下一版可以增加一点介绍~比如专门增加一章讲解,CUDA程序性能衡量指标,以及对应的公式,工具。 书写的很好,感谢作者!

- OS: win10 - PyTorch version: 1.10 - How you installed PyTorch (conda, pip, source): pip - Python version: 3.9 - CUDA/cuDNN version: 11.3 - GPU models and configuration: 1650...

Hi! I am wondering how to debug in such environment? I have tried to insert a "printf("hello wolrd")" sentence in .cu file, but it compiles failure! If I delete it,...

Hi! Thank you for this repo! It is very helpful to me!!! I have a question, in the wiki part, the last comparison figure has a Max64-8 or Max64-16. I...