Haicheng Wu comments

Results 323 comments of


                                            Haicheng Wu

[BUG] batch GEMM execution via cutlass_profiler gives weird outputs

BTW, if you want to run batch gemm for the transformer model. Group GEMM may be more useful to you. Check https://github.com/NVIDIA/cutlass/tree/master/examples/24_gemm_grouped . Group GEMM is not runnable in the...

[BUG] batch GEMM execution via cutlass_profiler gives weird outputs

36 is not multiple of 8. The kernel instantiated in the example needs M to be multiple of 8. You can change the alignment to run 36. Or you can...

[BUG] batch GEMM execution via cutlass_profiler gives weird outputs

alignmentA is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L73 alignmentB is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/gemm/kernel/default_gemm_grouped.h#L81 alignmentC is set https://github.com/NVIDIA/cutlass/blob/master/include/cutlass/epilogue/thread/linear_combination.h#L58 Both A and C needs to be change to 4 if M is 36 and layouts are col...

[BUG] batch GEMM execution via cutlass_profiler gives weird outputs

Back to your original batch gemm profiling problem you can use this cmake command ``` cmake .. -DCUTLASS_NVCC_ARCHS=80 -DCUTLASS_LIBRARY_KERNELS=sgemm ``` to just run fp32 gemm on A100.

[FEA] Support latest NVCC to compile with cuda10

We actually tried `CUDA_VERSION`, but it causes us issues. We do not recommend to mix different versions of cuda and nvcc, especially the major version number is different. They are...

[FEA] Support latest NVCC to compile with cuda10

I honestly don't remember. We need to support the big matrix of different version {nvcc, nvrtc, clang, msvc, nvc++} x different gpus. One or a few of the combination has...

[QST] GemmArray with overlapping output chunks.

The same as cublas, your case is not supported by cutlass without any code change. However, you can use semophore to control the order of storing the global memory. This...

[QST] GemmArray with overlapping output chunks.

> Am I right, that the only reason I get wrong answer are the threadblocks using the memory that should have been locked? If I introduce a proper timing between...

[QST] GemmArray with overlapping output chunks.

Congratulations! If possible, maybe you can use `blockIdx.z` to decide the write order. When you finish your code, you can first make your repository public. We can help guide people...

[FEA] Would group conv be supported in cutlass future release?

not in 2.6 :smile: