ziyu huang
ziyu huang
Hi! I have written a code for slicedK in GEMM, but it seems very slow....I tried to understand cutlass's slicedK, but can not understand it....So I post my code here...
Hi! I am learning 'tall' matmul and find it **hard to find the code** describing how slice K reduce the value.... I think, each wrap will calculate 32*64 values (each...
Hi! I am learning cutlass, and I see something like: (from official post) ```C++ /// CUTLASS SGEMM example __global__ void gemm_kernel(void gemm_kernel( float *C, float *C, float const *A, float...
Hi! I am learning cutlass. And I read this post: [CUTLASS: Fast Linear Algebra in CUDA C++ | NVIDIA Technical Blog](https://developer.nvidia.com/blog/cutlass-linear-algebra-cuda/) But I can not find official “dispatch_policies.h”, only find...
Hi! I am using windows....And I have heard cutlass can be used on windows. Could you write some document to guide us? Thank you!!!
樊老师,您好!我已经阅读完了您的书籍,现在在使用CUDA开发一种高性能的算法,能比NV的某官方库更快。不知道您是否有类似于读者群之类的HPC交流群?我是某985大学的硕士,应该也可以为社群做出贡献。多谢!
 此处的FMA只提到了 d=a*b+c,具体是怎么算的?因为这种公式可能可以引用到论文中,或许下一版可以增加一点介绍~比如专门增加一章讲解,CUDA程序性能衡量指标,以及对应的公式,工具。 书写的很好,感谢作者!
- OS: win10 - PyTorch version: 1.10 - How you installed PyTorch (conda, pip, source): pip - Python version: 3.9 - CUDA/cuDNN version: 11.3 - GPU models and configuration: 1650...
Hi! I am wondering how to debug in such environment? I have tried to insert a "printf("hello wolrd")" sentence in .cu file, but it compiles failure! If I delete it,...
Hi! Thank you for this repo! It is very helpful to me!!! I have a question, in the wiki part, the last comparison figure has a Max64-8 or Max64-16. I...