rocBLAS GEMM much slower than GEMV for multiplying column or row vectors

What is the expected behavior

GEMM where one of m or n is 1 should perform similarly to the equivalent GEMV call, because it could simply call the GEMV kernel (as cuBLAS does).

What actually happens

GEMM performs much worse than GEMV.

How to reproduce

For simplicity I'm using cupy's wrappers but this just calls the underlying rocBLAS functions with appropriate checks. I have a script gemm.py

import cupy
from cupy_backends.cuda.libs.cublas import CUBLAS_OP_N, CUBLAS_OP_T
u = cupy.random.random((1, 10000,))
V = cupy.random.random((10, 10000))
out = cupy.empty((1, 10,))

for _ in range(100):
    cupy.cublas.gemm(CUBLAS_OP_N, CUBLAS_OP_T, u, V, alpha=1, beta=0, out=out)

for _ in range(100):
    cupy.cublas.gemv(CUBLAS_OP_N, 1, V, u[0], 0, out[0])

and run the script under rocprof to get kernel timings

rocprof --stats python gemm.py

Which shows two distinct kernels called 100 times each. The gemv kernel takes around 12.5 us but the gemm kernel takes 1273 us to perform the same computation.

"Name","Calls","TotalDurationNs","AverageNs","Percentage"
"Cijk_Ailk_Bljk_DB_MT64x32x8_SE_1LDSB0_APM1_AF0EM1_AF1EM1_AMAS0_ASAE01_ASCE01_ASEM1_BL1_DTL0_DVO0_EPS1_FL0_GRVW1_GSU1_ISA906_IU1_K1_KLA_LBSPP0_LPA0_LPB0_LDL1_LRVW1_MAC_MDA2_NLCA1_NLCB1_ONLL1_OPLV0_PK0_PAP0_PGR1_PLR1_RK0_SIA1_SS0_SU32_SUM0_SUS256_SRVW0_SVW2_SNLL0_TT4_4_TLDS0_USFGRO1_VAW1_VS1_VW1_WSGRA0_WSGRB0_WS64_WG16_8_1_WGM1.kd",100,127322978,1273229,98.11602450966042
"void gemvt_kernel<false, 256, double, double, double, double>(int, int, double, long, double const*, long, int, long, double const*, long, int, long, double, long, double*, long, int, long) [clone .kd]",100,1246876,12468,0.9608518281476839

Environment

Hardware	description
GPU	AMD Vega 20
CPU	AMD EPYC 7742 64-Core Processor

Software	version
ROCK	Not sure?
ROCR	v4.3.1
HCC	v4.3.21331-94fc2572
Library	v4.3.1

Mar 20 '22 11:03 peterbell10

Hi @peterbell10, Thanks for bringing this up. I am also seeing the slowdown in gemm compared to gemv using our rocblas-bench tool. I'll add this to my list and get back to you when I have some changes ready.

Mar 22 '22 16:03 daineAMD

Just wanted to update this and let you know that we have changes in the works. The performance comparison for gemv over gemm with m == 1 || n == 1 isn't necessarily clear across all architectures, matrix operations, and sizes, so this might take a little while longer to ensure performance improvements all around as expected.

Apr 14 '22 22:04 daineAMD