CUDALibrarySamples icon indicating copy to clipboard operation
CUDALibrarySamples copied to clipboard

CublasLtMatMul seems slow compared with Gemm

Open wangyems opened this issue 4 years ago • 3 comments

I tried to replace SGemm() with CublasLtMatMul() for its multiple choices of Algos such as Tile but found that CublasLtMatMul() is in general slower compared with Gemm(). Is it expected?

Here is a profiling tool out there for you for reproduce: https://github.com/jeng1220/cuGemmProf

e.g ./cuGemmProf -m 512 -n 768 -k 3072 --type 5,6 -l 1000 --all_algo
Device, Op(A), Op(B), M, N, K, ComputeType, A, B, C, DP4A.Restrictions(lda.ldb), TensorCoreRestrictions(m.k.A.B.C.lda.ldb.ldc), Algo, Time(ms), GFLOPS, LtAlgoId, TileId, SpliteK, Red.Sch, Swizzle, CustomId, WorkSpaceSize, WaveCount
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_DEFAULT, 0.214479, 11264.1
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO22, 0.215919, 11189
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO9, 0.223748, 10797.5
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO1_TENSOR_OP, 0.0716704, 33708.8
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO0_TENSOR_OP, 0.170531, 14167.1
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLAS_GEMM_ALGO7_TENSOR_OP, 0.198502, 12170.8
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_32F, CUDA_R_32F, CUDA_R_32F, CUDA_R_32F, all meet, all meet, CUBLASLT_1ST_HEURISTIC_ALG, 0.240428, 10048.4, 1, CUBLASLT_MATMUL_TILE_64x32, 1, CUBLASLT_REDUCTION_SCHEME_NONE, 0, 0, 0, 1.000000
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO5, 0.406726, 5939.92
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_DEFAULT, 0.406746, 5939.63
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO4, 0.410825, 5880.65
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO14_TENSOR_OP, 0.406679, 5940.61
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO12_TENSOR_OP, 0.406684, 5940.53
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLAS_GEMM_ALGO3_TENSOR_OP, 0.406689, 5940.45
Tesla V100-PCIE-16GB, CUBLAS_OP_N, CUBLAS_OP_N, 512, 768, 3072, CUBLAS_COMPUTE_64F, CUDA_R_64F, CUDA_R_64F, CUDA_R_64F, all meet, all meet, CUBLASLT_1ST_HEURISTIC_ALG, 0.583007, 4143.89, 0, CUBLASLT_MATMUL_TILE_128x64, 1, CUBLASLT_REDUCTION_SCHEME_NONE, 0, 0, 0, 0.000000

wangyems avatar Apr 01 '21 23:04 wangyems

any updates?

wangyems avatar Apr 15 '21 21:04 wangyems

The best algo in cuBLAS will match the best algo in cuBLASLt. Did you do an algo search in cuBLASLt to find the best? cuBLAS has built-in heuristics to choose the best.

mnicely avatar Dec 15 '22 12:12 mnicely

@wangyems I know it has been quite some time but I hope you have found an answer. Please let me know If this is still relevant and we can help.

hbabak avatar May 10 '23 04:05 hbabak