GEMM much slower than GEMV for multiplying column or row vectors
What is the expected behavior
GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).
What actually happens
GEMM performs much worse than GEMV.
How to reproduce
For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script gemm.py
import cupy
from cupy_backends.cuda.libs.cublas import CUBLAS_OP_N, CUBLAS_OP_T
u = cupy.random.random((1, 10000,))
V = cupy.random.random((10, 10000))
out = cupy.empty((1, 10,))
for _ in range(100):
cupy.cublas.gemm(CUBLAS_OP_N, CUBLAS_OP_T, u, V, alpha=1, beta=0, out=out)
for _ in range(100):
cupy.cublas.gemv(CUBLAS_OP_N, 1, V, u[0], 0, out[0])
and run the script under rocprof to get kernel timings
rocprof --stats python gemm.py
Which shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.
"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"Cijk_Ailk_Bljk_DB_MT64x32x8_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS0_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW1_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW2_SNLL0_TT4_4_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1.kd",100,127322978,1273229,98.11602450966042
"void gemvt_kernel<false, 256, double, double, double, double>(int, int, double, long, double const*, long, int, long, double const*, long, int, long, double, long, double*, long, int, long) [clone .kd]",100,1246876,12468,0.9608518281476839
Environment
| Hardware | description |
|---|---|
| GPU | AMD Vega 20 |
| CPU | AMD EPYC 7742 64-Core Processor |
| Software | version |
|---|---|
| ROCK | Not sure? |
| ROCR | v4.3.1 |
| HCC | v4.3.21331-94fc2572 |
| Library | v4.3.1 |
Hi @peterbell10, Thanks for bringing this up. I am also seeing the slowdown in gemm compared to gemv using our rocblas-bench tool. I'll add this to my list and get back to you when I have some changes ready.
Just wanted to update this and let you know that we have changes in the works. The performance comparison for gemv over gemm with m == 1 || n == 1 isn't necessarily clear across all architectures, matrix operations, and sizes, so this might take a little while longer to ensure performance improvements all around as expected.