CLBlast CGEMM performance lower than SGEMM?

Hi, Thanks again for this great library.

I used clblas a lot before switching to clBLAST, especially because it behaves better on other than AMD platforms due to the autotuning, and because some functions, including GEMM, have batch. On an AMD GPU, the SGEMM performs better on clblas than on clBLAST, except on small matrices when using batch. On big matrices it achieves about 50 to 70% of theoretical bandwidth. On CGEMM though, it does not behave as well. What I observed with clblas is that with CGEMM we can achieve better performance than with SGEMM (let's say 900GFlops/s on 4K by 4K matrices, instead of 700GFlops/s). With clBLAST (I have tested the master, version 1.5.0, and version 1.4.1) if we can achieve 500Gflops/s with 4K SGEMM, we can only achive 200-250GFlops/s with CGEMM. I have tested on both AMD devices with RoCM 4.0, and Nvidia hardware (old Kepler GT650m), and I observed the same behavior.

Is it to be expected?

Feb 27 '21 13:02 JishinMaster

Good to hear that you've discovered CLBlast!

First of all, CLBlast also uses tuning. However, it relies on tuning data from others to already get decent speed. So if there is a device similar to yours already in the database, it might be fast. If not, there might be something to gain. Second, there might be supporting smaller kernels around the main GEMM kernel that could be slow in your case. Third, the implementation might be slow for your specific matrix sizes and not for the default ones that it was tuned for.

We can test all of these 3 easily by running the tuner instead of running your own code. That way we'll only run a single GEMM kernel and will be able to find the best performance. So, first make sure you compile the library with -DTUNERS=ON passed to CMake. Then, run on both your devices the tuner and report the inspect the best reported GFLOPS values and report them here. Example run of the tuner:

./clblast_tuner_xgemm --precision 3232 -m 512 -n 512 -k 512

Note that it runs sort of 4 times with 4 different kernel/search strategies. And note that higher values of m/n/k will typically give you better speed in terms of GFLOPS.

Feb 28 '21 13:02 CNugteren

Hi, The performance I observed are obtained after following the clBLAST tuning guide (make alltuners + python script).

For good measure, I have relaunched the specific command you asked on both the GT650m (an ASUS laptop) and RX460 (a desktop).

GT650m : SGEMM : 1

Got average result of 5.51 ms: 48.7 GFLOPS
Found best result 1.61 ms: 167.0 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=8 NWG=64 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4 2
Got average result of 12.52 ms: 21.4 GFLOPS
Found best result 1.26 ms: 213.3 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=16 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=16 NDIMC=16 NWG=128 PRECISION=32 SA=1 SB=0 STRM=0 STRN=0 VWM=4 VWN=4 11
Got average result of 13.35 ms: 20.1 GFLOPS
Found best result 3.37 ms: 79.6 GFLOPS
Best parameters: GEMMK=1 KREG=2 KWG=1 KWI=1 MDIMA=16 MDIMC=16 MWG=64 NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=2 12
Got average result of 54.56 ms: 4.9 GFLOPS
Found best result 5.43 ms: 49.4 GFLOPS
Best parameters: GEMMK=1 KREG=4 KWG=1 KWI=1 MDIMA=8 MDIMC=8 MWG=32 NDIMB=16 NDIMC=16 NWG=128 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=2

CGEMM: 1

Got average result of 22.60 ms: 11.9 GFLOPS
Found best result 6.09 ms: 44.1 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=32 NDIMB=8 NDIMC=8 NWG=64 PRECISION=3232 SA=0 SB=0 STRM=0 STRN=0 VWM=2 VWN=2 2
Got average result of 46.01 ms: 5.8 GFLOPS
Found best result 4.72 ms: 56.9 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=16 KWI=2 MDIMA=32 MDIMC=8 MWG=32 NDIMB=32 NDIMC=32 NWG=128 PRECISION=3232 SA=1 SB=0 STRM=0 STRN=0 VWM=1 VWN=4 11
Got average result of 37.98 ms: 7.1 GFLOPS
Found best result 6.73 ms: 39.9 GFLOPS
Best parameters: GEMMK=1 KREG=1 KWG=1 KWI=1 MDIMA=16 MDIMC=16 MWG=32 NDIMB=4 NDIMC=4 NWG=32 PRECISION=3232 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1 12
Got average result of 172.19 ms: 1.6 GFLOPS
Found best result 8.07 ms: 33.3 GFLOPS
Best parameters: GEMMK=1 KREG=2 KWG=1 KWI=1 MDIMA=32 MDIMC=32 MWG=32 NDIMB=8 NDIMC=8 NWG=128 PRECISION=3232 SA=0 SB=0 STRM=0 STRN=0 VWM=1 VWN=1

RX460 SGEMM : 1

Got average result of 0.97 ms: 275.4 GFLOPS
Found best result 0.36 ms: 737.4 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=16 MDIMC=16 MWG=64 NDIMB=16 NDIMC=16 NWG=64 PRECISION=32 SA=1 SB=1 STRM=0 STRN=0 VWM=4 VWN=2 2
Got average result of 1.01 ms: 265.3 GFLOPS
Found best result 0.36 ms: 751.6 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=32 NDIMB=8 NDIMC=32 NWG=128 PRECISION=32 SA=1 SB=1 STRM=0 STRN=1 VWM=4 VWN=2 11
Got average result of 1.98 ms: 135.4 GFLOPS
Found best result 0.62 ms: 435.7 GFLOPS
Best parameters: GEMMK=1 KREG=4 KWG=1 KWI=1 MDIMA=16 MDIMC=16 MWG=64 NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4 xgemm_12 fails (related to another PR)

CGEMM : 1

Got average result of 2.56 ms: 104.9 GFLOPS
Found best result 1.21 ms: 222.3 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=32 MDIMC=32 MWG=64 NDIMB=8 NDIMC=8 NWG=64 PRECISION=3232 SA=1 SB=1 STRM=0 STRN=0 VWM=1 VWN=2 2
Got average result of 11.47 ms: 23.4 GFLOPS
Found best result 1.18 ms: 227.9 GFLOPS
Best parameters: GEMMK=0 KREG=1 KWG=32 KWI=2 MDIMA=8 MDIMC=8 MWG=64 NDIMB=32 NDIMC=32 NWG=64 PRECISION=3232 SA=1 SB=1 STRM=1 STRN=0 VWM=4 VWN=2 11
Got average result of 11.95 ms: 22.5 GFLOPS
Found best result 1.28 ms: 209.0 GFLOPS
Best parameters: GEMMK=1 KREG=4 KWG=1 KWI=1 MDIMA=16 MDIMC=16 MWG=64 NDIMB=16 NDIMC=16 NWG=64 PRECISION=3232 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4 xgemm_12 fails (related to another PR)

Mar 01 '21 17:03 JishinMaster

Thanks for running the experiments. So in summary you have:

device	SGEMM	CGEMM
GT650m	213.3	56.9
RX460	751.6	227.9

I'm not sure what's normal for these devices, but indeed CGEMM is a lot slower than SGEMM in your experiments. Unfortunately I only stored SGEMM graphs for all device I've tested in the past (e.g. https://cnugteren.github.io/clblast/results/tahiti.html), so that is of no help as a reference. I just tested on my own device and indeed I also get a lot lower CGEMM performance compared to SGEMM.

So back to the main question: is it normal that the performance is lower? First of all, keep in mind that in both cases the formula for computing GFLOPS from the execution time is the same (at least the one used in CLBlast), so if SGEMM finishes in 10ms for a given matrix size and CGEMM finishes in 10ms as well, they will get the same number. Given that, I would say it is expected that CGEMM is slower because of two reasons:

It needs to load twice as much data from memory, so that can lead to a little bit of a slowdown depending on the device/parameters.
It needs to do 6 operations (4 multiplications, an addition, and a subtraction) for a single 'multiplication' of two complex numbers (if I'm correct).

Given that, I don't think your results are bad, actually they seem rather good given that you only get a factor 3 or 4 slowdown.

Now these are the numbers you get from the tuner. In reality you might get a bit worse performance when you run it in your program (you can use the CLBlast clients for that to test, compile with -DCLIENTS=ON and run e.g. clblast_client_xgemm) because of pre/post-processing kernels depending on your memory layout (row/col) and transpose options for the two matrices.

Mar 03 '21 13:03 CNugteren

keep in mind that in both cases the formula for computing GFLOPS from the execution time is the same

I think I know understand why it seemed weird! What formula does CLBlast uses? The formula I use for computing the flops on SGEMM and CGEMM on my program are :

SGEMM : MN(2*K-1)
CGEMM : MN(8*K-2)

Hence if CLBlast uses the same flop count for both operations, it's normal that the reported performance gets lower since we roughly need to do 4 times more operations => hence the 56GFlops/s per second reported would instead be 226GFlops/s, a little bit more than what SGEMM acheived. It makes more sense to me since CGEMM is more compute bound than SGEMM for the same matrix size. I use CGEMM benchmark for thermal stress test and for checking the maximum achievable GFlops/s performance (vs theoretical).

Mar 04 '21 06:03 JishinMaster

Indeed, so I think that solves this mystery :-)

But you are right, perhaps that should be computed differently. For the GEMM tuner this is done here: https://github.com/CNugteren/CLBlast/blob/master/src/tuning/kernels/xgemm.hpp#L162 And in the performance test 'client' that is done here in the same way: https://github.com/CNugteren/CLBlast/blob/master/test/routines/level3/xgemm.hpp#L196 Feel free to make a PR to improve this for the (single & double precision) complex number cases.

Mar 06 '21 13:03 CNugteren

CLBlast CLBlast copied to clipboard

CGEMM performance lower than SGEMM?

CLBlast
CLBlast copied to clipboard