CUDA_gemm
CUDA_gemm copied to clipboard
A simple high performance CUDA GEMM, Block Sparse GEMM and Non-uniform Quantized GEMM implementation.
I believe you are using the boundary of C for matrices A and B https://github.com/Cjkkkk/CUDA_gemm/blob/14b517370609d322647c55fe9136b6d81c2ba9a7/src/cuda/dense.cu#L107 https://github.com/Cjkkkk/CUDA_gemm/blob/14b517370609d322647c55fe9136b6d81c2ba9a7/src/cuda/dense.cu#L125
Added cmake compilation options to the project
See https://github.com/Cjkkkk/CUDA_gemm/issues/6.