Optimizing-SGEMM-on-NVIDIA-Turing-GPUs
Optimizing-SGEMM-on-NVIDIA-Turing-GPUs copied to clipboard
Optimizing SGEMM kernel functions on NVIDIA GPUs to a close-to-cuBLAS performance.
Dear @yzhaiustc , Thanks for you amazing effort for this repo. However, we cannot build this project, because the ``helper_functions..h`` is missing. Can you provide such files? Thanks!
``` cpp for (n_count = 0; n_count < N; n_count++) { cudaEventRecord(beg); test_kernel(kernel_num, m, n, k, alpha, dA, dB, beta, dC, err); cudaEventRecord(end); cudaEventSynchronize(beg); cudaEventSynchronize(end); cudaEventElapsedTime(&ms, beg, end); elapsed_time +=...
kernel3
您好,我就用中文提问了呀。在kernel3中,你把blocksize从(32,32)改为(1024),这种做法的优点你说有3点好处:1.storing threadIdx.x before re-using it massively 2. in order to reduce living registers 3. benefit the compiler optimization 这几点我都不太懂是啥意思。在书中和网上都找不到对应的解释,能麻烦您能说的详细一些吗? 如果还能给出参考资料那也是最好不过的!