CLBlast Sub-optimal performance on Qualcomm Adreno GPUs

As reported by sivagnanamn in issue #181, perhaps on Qualcomm Adreno GPUs is far from the theoretical maximum. This should be addressed by comparing reference GEMM implementations on Adreno hardware with the design of CLBlast. One example is in a Qualcomm tutorial part 1 and part 2, but not much else seems to be available.

Any links to fast OpenCL implementations on Adreno GPUs are welcome here. When I have some time I'll try to get hold of such a device and try some things out.

Dec 15 '17 20:12 CNugteren

I've added a test branch (adreno_tryout) in CLBlast to test the Qualcomm-provided kernel from the tutorial mentioned above. This is a very hacky integration of that kernel and is in no means meant to be actually used. However, it is there to be able to find out if that kernel does fix the performance issues with CLBlast. If so, I can work towards integration such a kernel properly. If not, we'll have to continue investigating. One thing suggested by the tutorial is using an OpenCL image for matrix B, but I didn't implement that.

The branch adreno_tryout contains the Qualcomm-provided kernel and also a modified tuner to tune the local workgroup size. Could anyone with an Adreno GPU (e.g. @sivagnanamn) please do the following for me:

First test performance with the latest master branch for reference, e.g. ./clblast_client_xgemm -m 256 -n 256 -k 256 -num_steps 4 -step 256
Check out adreno_tryout and run ./clblast_test_xgemm to see if everything works OK.
Run the Qualcomm-specific tuner: ./clblast_tuner_xgemm and share the output here.
Modify the numbers in src/database/kernels/xgemm/xgemm_32.hpp according to the output of the tuner.
Test performance again using ./clblast_client_xgemm and compare with what you had before.

Notes: this branch is currently single-precision FP32 only and assumes alpha=1 and beta=0.

Dec 24 '17 14:12 CNugteren

@CNugteren Thank you for your time.

I'm getting memory error while building the latest master with -DTUNERS=ON & -DCLIENTS=ON.

root@linaro:/home/linaro/CLBlast/build# make
-- Building CLBlast with OpenCL API (default)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/linaro/CLBlast/build
Scanning dependencies of target clblast
[  0%] Building CXX object CMakeFiles/clblast.dir/src/database/database.cpp.o

cc1plus: out of memory allocating 1126128 bytes after a total of 102879232 bytes
make[2]: *** [CMakeFiles/clblast.dir/src/database/database.cpp.o] Error 1
make[1]: *** [CMakeFiles/clblast.dir/all] Error 2
make: *** [all] Error 2

Available RAM (before "make"):

root@linaro-developer:/home/linaro/CLBlast/build# free -m
             total       used       free     shared    buffers     cached
Mem:          1626         98       1527          0          2         15
-/+ buffers/cache:         80       1546

Available HDD space (before "make"):

root@linaro:/home/linaro/CLBlast/build# df -h
Size  Used Avail Use% Mounted on
7.9G  5.6G  2.3G  71% /

I'm currently checking with an older commit (249bdaa8e9a111573f5c3a821230bba6437817c7) Dec 19,2017. I'll share the output asap.

Dec 25 '17 00:12 sivagnanamn

@CNugteren

First test performance with the latest master branch for reference, e.g. ./clblast_client_xgemm -m 256 -n 256 -k 256 -num_steps 4 -step 256

                                                                                                                                             | <--       CLBlast       --> | <--       clBLAS        --> | <--      CPU BLAS       --> |
        m;        n;        k;   layout;   transA;   transB;      lda;      ldb;      ldc;     offa;     offb;     offc;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;     ms_3; GFLOPS_3;    GBs_3
      256;      256;      256;      101;      111;      111;      256;      256;      256;        0;        0;        0;     2.00;     2.00;    24.93;      1.3;      0.0;    35.19;      1.0;      0.0;    31.43;      1.1;      0.0
      512;      512;      512;      101;      111;      111;      512;      512;      512;        0;        0;        0;     2.00;     2.00;    36.69;      7.3;      0.1;   238.70;      1.1;      0.0;   253.38;      1.1;      0.0
      768;      768;      768;      101;      111;      111;      768;      768;      768;        0;        0;        0;     2.00;     2.00;   131.05;      6.9;      0.1;   818.74;      1.1;      0.0;   874.82;      1.0;      0.0
       1K;       1K;       1K;      101;      111;      111;       1K;       1K;       1K;        0;        0;        0;     2.00;     2.00;   253.07;      8.5;      0.1;  1874.76;      1.1;      0.0;  2319.10;      0.9;      0.0

Check out adreno_tryout and run ./clblast_test_xgemm to see if everything works OK.

* Completed all test-cases for this routine. Results:
   234 test(s) passed
   278 test(s) skipped
   0 test(s) failed

Run the Qualcomm-specific tuner: ./clblast_tuner_xgemm and share the output here.

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 32 (single) [=default]
    -m 1024 [=default]
    -n 1024 [=default]
    -k 1024 [=default]
    -alpha 2.00 [=default]
    -beta 2.00 [=default]
    -fraction 1.00 [=default]
    -runs 2 [=default]
    -max_l2_norm 0.00 [=default]
* Found best result 276.22 ms: 7.8 GFLOPS
* Best parameters: KWG=4 KWI=1 MDIMA=8 MDIMC=8 MWG=32 NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4

Qualcomm XGEMM Tuner JSON output ==> clblast_xgemm_1_32.json.zip

Modify the numbers in src/database/kernels/xgemm/xgemm_32.hpp according to the output of the tuner. Test performance again using ./clblast_client_xgemm and compare with what you had before.

                                                                                                                       | <--       CLBlast       --> | <--       clBLAS        --> | <--      CPU BLAS       --> |
       m;        n;        k;   layout;   transA;   transB;      lda;      ldb;      ldc;     offa;     offb;     offc;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;     ms_3; GFLOPS_3;    GBs_3
     256;      256;      256;      101;      111;      111;      256;      256;      256;        0;        0;        0;     7.24;      4.6;      0.1;    31.69;      1.1;      0.0;    31.43;      1.1;      0.0
     512;      512;      512;      101;      111;      111;      512;      512;      512;        0;        0;        0;    36.15;      7.4;      0.1;   248.76;      1.1;      0.0;   280.43;      1.0;      0.0
     768;      768;      768;      101;      111;      111;      768;      768;      768;        0;        0;        0;   109.09;      8.3;      0.1;   848.87;      1.1;      0.0;  1120.44;      0.8;      0.0
      1K;       1K;       1K;      101;      111;      111;       1K;       1K;       1K;        0;        0;        0;   274.32;      7.8;      0.1;  1960.96;      1.1;      0.0;  3284.81;      0.7;      0.0

Overall performance improvement:

With the adreno_try branch, I got ~150-200 ms improvement in overall execution speed of my DNN inference engine.

Ex: ./clblast_client_xgemm -m 32 -n 50176 -k 144 -alpha 1.00 -beta 0.00 -precision 32

master branch (249bdaa8e9a111573f5c3a821230bba6437817c7)

                                                                                                                        | <--       CLBlast       --> | <--       clBLAS        --> | <--      CPU BLAS       --> |
        m;        n;        k;   layout;   transA;   transB;      lda;      ldb;      ldc;     offa;     offb;     offc;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;     ms_3; GFLOPS_3;    GBs_3
       32;      49K;      144;      101;      111;      111;      144;      49K;      49K;        0;        0;        0;   187.26;      2.5;      0.2;   482.41;      1.0;      0.1;   633.15;      0.7;      0.1

adreno_try branch

                                                                                                                                             | <--       CLBlast       --> | <--       clBLAS        --> | <--      CPU BLAS       --> |
        m;        n;        k;   layout;   transA;   transB;      lda;      ldb;      ldc;     offa;     offb;     offc;    alpha;     beta;     ms_1; GFLOPS_1;    GBs_1;     ms_2; GFLOPS_2;    GBs_2;     ms_3; GFLOPS_3;    GBs_3
       32;      49K;      144;      101;      111;      111;      144;      49K;      49K;        0;        0;        0;     1.00;     0.00;    51.29;      9.0;      0.8;   489.78;      0.9;      0.1;   830.61;      0.6;      0.1

Thanks again for your time.

Dec 25 '17 08:12 sivagnanamn

@CNugteren

Actually, master branch (bd540829ea1954c3b367ec70aa8c5811b643422b) works better than adreno_try branch on my Xiaomi 5 with Snapdragon 820 and Adreno 530 GPU. With master branch, clblast_tuner_xgemm reports about 42 GFLOPS for default setting, while with adreno_try branch, the number is 31.2 GFLOPS.

Dec 25 '17 14:12 wangjinwei94

@wang-jinwei Thank you very much for sharing the GFLOPS details.

GPU	Theoritical GFLOPS	CLBlast
Adreno 330	166.5 (578 MHz)	7~10
Adreno 530	519 (650 MHz)	42

Source: Adreno Wiki

So I believe still there's scope for improvement. Thanks to @CNugteren 's efforts for reducing this gap.

Please correct me if I'm wrong.

Dec 26 '17 00:12 sivagnanamn

Using the current adreno_tryout branch, I tried to tune GEMM for a custom MNK size(results shown below):

The best tuner configuration gave 115 GFLOPS at 1.51ms.

./clblast_tuner_xgemm -m 16 -n 200704 -k 27 -alpha 1.00 -beta 0.00 -precision 32

|   ID | total |                                                                 param |       compiles |         time | GFLOPS |            status |
x------x-------x-----------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|  ref |     - |                                                                     - |             OK |     35.66 ms |      - |      reference OK |
x------x-------x-----------------------------------------------------------------------x----------------x--------------x--------x-------------------x
|    1 |    30 |    8   16    4    2    2    2    2    1    4    4    0    0    0    0 |   OK    270 ms |      2.29 ms |      - | L2 error 1.22e-02 | <-- skipping
|    2 |    30 |    8   32    4    2    4    2    4    1    4    4    0    0    0    0 |   OK    312 ms |      1.31 ms |      - | L2 error 1.22e-02 | <-- skipping
|    3 |    30 |    8   64    4    2    8    2    8    1    4    4    0    0    0    0 |   OK    311 ms |      0.81 ms |      - | L2 error 1.22e-02 | <-- skipping
|    4 |    30 |    8  128    4    2   16    2   16    1    4    4    0    0    0    0 |   OK    289 ms |      0.57 ms |      - | L2 error 1.22e-02 | <-- skipping
|    5 |    30 |    8  256    4    2   32    2   32    1    4    4    0    0    0    0 |   OK    330 ms |      2.61 ms |      - | L2 error 1.22e-02 | <-- skipping
|    6 |    30 |   16   16    4    4    2    4    2    1    4    4    0    0    0    0 |   OK    271 ms |      1.30 ms |      - | L2 error 1.22e-02 | <-- skipping
|    7 |    30 |   16   32    4    4    4    4    4    1    4    4    0    0    0    0 |   OK    273 ms |      0.80 ms |      - | L2 error 1.22e-02 | <-- skipping
|    8 |    30 |   16   64    4    4    8    4    8    1    4    4    0    0    0    0 |   OK    291 ms |      0.65 ms |      - | L2 error 1.22e-02 | <-- skipping
|    9 |    30 |   16  128    4    4   16    4   16    1    4    4    0    0    0    0 |   OK    284 ms |      2.76 ms |      - | L2 error 1.22e-02 | <-- skipping
|   10 |    30 |   16  256    4    4   32    4   32    1    4    4    0    0    0    0 |   OK    292 ms |      1.88 ms |      - | L2 error 1.22e-02 | <-- skipping
|   11 |    30 |   32   16    4    8    2    8    2    1    4    4    0    0    0    0 |   OK    273 ms |      1.27 ms |      - | L2 error 8.36e-03 | <-- skipping
|   12 |    30 |   32   32    4    8    4    8    4    1    4    4    0    0    0    0 |   OK    277 ms |      0.88 ms |      - | L2 error 8.36e-03 | <-- skipping
|   13 |    30 |   32   64    4    8    8    8    8    1    4    4    0    0    0    0 |   OK    311 ms |      5.48 ms |      - | L2 error 8.36e-03 | <-- skipping
|   14 |    30 |   32  128    4    8   16    8   16    1    4    4    0    0    0    0 |   OK    291 ms |      3.51 ms |      - | L2 error 8.36e-03 | <-- skipping
|   15 |    30 |   32  256    4    8   32    8   32    1    4    4    0    0    0    0 |   OK    324 ms |      3.69 ms |      - | L2 error 8.36e-03 | <-- skipping
|   16 |    30 |   64   16    4   16    2   16    2    1    4    4    0    0    0    0 |   OK    275 ms |      1.51 ms |  115.1 |     results match |
|   17 |    30 |   64   32    4   16    4   16    4    1    4    4    0    0    0    0 |   OK    289 ms |      7.14 ms |   24.3 |     results match |
|   18 |    30 |   64   64    4   16    8   16    8    1    4    4    0    0    0    0 |   OK    299 ms |      4.71 ms |   36.8 |     results match |
|   19 |    30 |   64  128    4   16   16   16   16    1    4    4    0    0    0    0 |   OK    307 ms |     11.06 ms |   15.7 |     results match |
|   20 |    30 |   64  256    4   16   32   16   32    1    4    4    0    0    0    0 |   OK    304 ms |  error -54   |      - |   invalid config. | <-- skipping
|   21 |    30 |  128   16    4   32    2   32    2    1    4    4    0    0    0    0 |   OK    292 ms |     21.01 ms |      - | L2 error 1.60e-02 | <-- skipping
|   22 |    30 |  128   32    4   32    4   32    4    1    4    4    0    0    0    0 |   OK    330 ms |      8.93 ms |      - | L2 error 1.60e-02 | <-- skipping
|   23 |    30 |  128   64    4   32    8   32    8    1    4    4    0    0    0    0 |   OK    300 ms |     21.38 ms |      - | L2 error 1.60e-02 | <-- skipping
|   24 |    30 |  128  128    4   32   16   32   16    1    4    4    0    0    0    0 |   OK    328 ms |  error -54   |      - |   invalid config. | <-- skipping
|   25 |    30 |  128  256    4   32   32   32   32    1    4    4    0    0    0    0 |   OK    324 ms |  error -54   |      - |   invalid config. | <-- skipping
|   26 |    30 |  256   16    4   64    2   64    2    1    4    4    0    0    0    0 |   OK    308 ms |     18.21 ms |      - | L2 error 4.63e-02 | <-- skipping
|   27 |    30 |  256   32    4   64    4   64    4    1    4    4    0    0    0    0 |   OK    325 ms |     22.91 ms |      - | L2 error 4.63e-02 | <-- skipping
|   28 |    30 |  256   64    4   64    8   64    8    1    4    4    0    0    0    0 |  compilation error:    -6     |      - |                 - | <-- skipping
|   29 |    30 |  256  128    4   64   16   64   16    1    4    4    0    0    0    0 |   OK    337 ms |  error -54   |      - |   invalid config. | <-- skipping
|   30 |    30 |  256  256    4   64   32   64   32    1    4    4    0    0    0    0 |   OK    290 ms |  error -54   |      - |   invalid config. | <-- skipping
x------x-------x-----------------------------------------------------------------------x----------------x--------------x--------x-------------------x


* Found best result 1.51 ms: 115.1 GFLOPS
* Best parameters: KWG=4 KWI=1 MDIMA=16 MDIMC=16 MWG=64 NDIMB=2 NDIMC=2 NWG=16 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4

* Writing a total of 4 results to 'clblast_xgemm_1_32.json'
* Completed tuning process

After tuning, updated src/database/kernels/xgemm/xgemm_32.hpp with the above parameter setting & tested using ./clblast_client_xgemm -m 16 -n 200704 -k 27 -alpha 1.00 -beta 0.00 -precision 32

GEMM tuner ==> 115 GFLOPS GEMM client ==> 0.1 GFLOPS

This is just an experimental branch, but still is this an expected behaviour? Please share your thoughts.

Dec 26 '17 05:12 sivagnanamn

@CNugteren For your reference, I found a sample GEMM example from Qualcomm official Adreno SDK (SDK link).

AdrenoExampleKernels.zip

Dec 26 '17 08:12 sivagnanamn

Thanks both for trying out, very useful!

@sivagnanamn: I'm working on improving compilation. This database.cpp was always a tricky one, I improved things over time, but I will now try to split it up to avoid long compilation times and excessive memory usage.

@sivagnanamn: I wouldn't try other sizes right now for tuning. Given that almost all results point out some error, don't trust them. I guess this is because the kernel assumes multiples of 16 or 32 or so, and you have e.g. k=27. That won't work.

So, about the results now, it seems the Qualcomm-provided kernel I plugged in is not ideal yet. I would want to focus first on the use-case of m=n=k=1024, then afterwards we will see other cases. And then the numbers reported by the tuner are most important, since the client might also run other pre/post kernels which we are currently not investigating.

So, at m=n=k=1024, @sivagnanamn reports roughly the same performance for the master branch and the try-out branch (both around 8 GFLOPS). @wang-jinwei even reports slightly decreased performance (40 -> 30 GFLOPS). So it seems the Qualcomm-provided kernel is not the holy-grail, although it might be better under certain circumstances (I even saw better performance on my Intel GPU). Conclusion for now is that I'll try to properly integrate this kernel/configuration in the whole CLBlast infrastructure, and if it is useful for a particular device, the tuner will find it.

By the way, is there any reference of what we could expect? The tutorial has a graph with 3 results of m=n=k=1024 with 23ms for the Adreno 530. That would translate to ~80 GFLOPS if I'm right, so we are a factor ~2 off it seems.

Next thing I would like to try is look at the use of images instead of regular OpenCL memory as pointed out in the tutorial. I will also look at the official SDK as you pointed out above, thanks.

Dec 26 '17 09:12 CNugteren

@CNugteren

Thanks for your time again. Actually, I found that the kernel provided by this tutorial can just deliver ~30 GFLOPS with m=n=k=1024 and Adreno 530 in my own project too, despite the ~23ms (~80 GFLOPS) reported by the author.

But there is a kernel may achieve ~55 GFLOPS for m=n=k=1024 on Adreno 530: (use global size = {128, 128} and local size = {8, 8})

__kernel void gemm(int m, int n, int k, __global const float* a, int lda,
    __global const float* b, int ldb, __global float* c, int ldc) {
  const int local_col=get_local_id(0);
  const int local_row=get_local_id(1);
  const int group_col=get_group_id(0);
  const int group_row=get_group_id(1);

  __local float8 asub[64][8];
  __local float8 bsub[64][8];
  __local float* asub_buf=(__local float*)asub;
  __local float* bsub_buf=(__local float*)bsub;
  float8 res0=0;
  float8 res1=0;
  float8 res2=0;
  float8 res3=0;
  float8 res4=0;
  float8 res5=0;
  float8 res6=0;
  float8 res7=0;
  const int tile=k/64;
  for(int t=0; t<tile; t++) {
    for(int i=0; i<8; i++) {
      for(int j=0; j<8; j++) {
        asub_buf[(local_row+i*8)*64+(local_col+j*8)]=a[(group_row*64+local_row+i*8)*lda+(t*64+local_col+j*8)];
        bsub_buf[(local_row+i*8)*64+(local_col+j*8)]=b[(t*64+local_row+i*8)*ldb+(group_col*64+local_col+j*8)];
      }
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    for(int i=0; i<8; i++) {
      res0+=asub[local_row+0*8][i].s0*bsub[8*i+0][local_col];
      res1+=asub[local_row+1*8][i].s0*bsub[8*i+0][local_col];
      res2+=asub[local_row+2*8][i].s0*bsub[8*i+0][local_col];
      res3+=asub[local_row+3*8][i].s0*bsub[8*i+0][local_col];
      res4+=asub[local_row+4*8][i].s0*bsub[8*i+0][local_col];
      res5+=asub[local_row+5*8][i].s0*bsub[8*i+0][local_col];
      res6+=asub[local_row+6*8][i].s0*bsub[8*i+0][local_col];
      res7+=asub[local_row+7*8][i].s0*bsub[8*i+0][local_col];
      res0+=asub[local_row+0*8][i].s1*bsub[8*i+1][local_col];
      res1+=asub[local_row+1*8][i].s1*bsub[8*i+1][local_col];
      res2+=asub[local_row+2*8][i].s1*bsub[8*i+1][local_col];
      res3+=asub[local_row+3*8][i].s1*bsub[8*i+1][local_col];
      res4+=asub[local_row+4*8][i].s1*bsub[8*i+1][local_col];
      res5+=asub[local_row+5*8][i].s1*bsub[8*i+1][local_col];
      res6+=asub[local_row+6*8][i].s1*bsub[8*i+1][local_col];
      res7+=asub[local_row+7*8][i].s1*bsub[8*i+1][local_col];
      res0+=asub[local_row+0*8][i].s2*bsub[8*i+2][local_col];
      res1+=asub[local_row+1*8][i].s2*bsub[8*i+2][local_col];
      res2+=asub[local_row+2*8][i].s2*bsub[8*i+2][local_col];
      res3+=asub[local_row+3*8][i].s2*bsub[8*i+2][local_col];
      res4+=asub[local_row+4*8][i].s2*bsub[8*i+2][local_col];
      res5+=asub[local_row+5*8][i].s2*bsub[8*i+2][local_col];
      res6+=asub[local_row+6*8][i].s2*bsub[8*i+2][local_col];
      res7+=asub[local_row+7*8][i].s2*bsub[8*i+2][local_col];
      res0+=asub[local_row+0*8][i].s3*bsub[8*i+3][local_col];
      res1+=asub[local_row+1*8][i].s3*bsub[8*i+3][local_col];
      res2+=asub[local_row+2*8][i].s3*bsub[8*i+3][local_col];
      res3+=asub[local_row+3*8][i].s3*bsub[8*i+3][local_col];
      res4+=asub[local_row+4*8][i].s3*bsub[8*i+3][local_col];
      res5+=asub[local_row+5*8][i].s3*bsub[8*i+3][local_col];
      res6+=asub[local_row+6*8][i].s3*bsub[8*i+3][local_col];
      res7+=asub[local_row+7*8][i].s3*bsub[8*i+3][local_col];
      res0+=asub[local_row+0*8][i].s4*bsub[8*i+4][local_col];
      res1+=asub[local_row+1*8][i].s4*bsub[8*i+4][local_col];
      res2+=asub[local_row+2*8][i].s4*bsub[8*i+4][local_col];
      res3+=asub[local_row+3*8][i].s4*bsub[8*i+4][local_col];
      res4+=asub[local_row+4*8][i].s4*bsub[8*i+4][local_col];
      res5+=asub[local_row+5*8][i].s4*bsub[8*i+4][local_col];
      res6+=asub[local_row+6*8][i].s4*bsub[8*i+4][local_col];
      res7+=asub[local_row+7*8][i].s4*bsub[8*i+4][local_col];
      res0+=asub[local_row+0*8][i].s5*bsub[8*i+5][local_col];
      res1+=asub[local_row+1*8][i].s5*bsub[8*i+5][local_col];
      res2+=asub[local_row+2*8][i].s5*bsub[8*i+5][local_col];
      res3+=asub[local_row+3*8][i].s5*bsub[8*i+5][local_col];
      res4+=asub[local_row+4*8][i].s5*bsub[8*i+5][local_col];
      res5+=asub[local_row+5*8][i].s5*bsub[8*i+5][local_col];
      res6+=asub[local_row+6*8][i].s5*bsub[8*i+5][local_col];
      res7+=asub[local_row+7*8][i].s5*bsub[8*i+5][local_col];
      res0+=asub[local_row+0*8][i].s6*bsub[8*i+6][local_col];
      res1+=asub[local_row+1*8][i].s6*bsub[8*i+6][local_col];
      res2+=asub[local_row+2*8][i].s6*bsub[8*i+6][local_col];
      res3+=asub[local_row+3*8][i].s6*bsub[8*i+6][local_col];
      res4+=asub[local_row+4*8][i].s6*bsub[8*i+6][local_col];
      res5+=asub[local_row+5*8][i].s6*bsub[8*i+6][local_col];
      res6+=asub[local_row+6*8][i].s6*bsub[8*i+6][local_col];
      res7+=asub[local_row+7*8][i].s6*bsub[8*i+6][local_col];
      res0+=asub[local_row+0*8][i].s7*bsub[8*i+7][local_col];
      res1+=asub[local_row+1*8][i].s7*bsub[8*i+7][local_col];
      res2+=asub[local_row+2*8][i].s7*bsub[8*i+7][local_col];
      res3+=asub[local_row+3*8][i].s7*bsub[8*i+7][local_col];
      res4+=asub[local_row+4*8][i].s7*bsub[8*i+7][local_col];
      res5+=asub[local_row+5*8][i].s7*bsub[8*i+7][local_col];
      res6+=asub[local_row+6*8][i].s7*bsub[8*i+7][local_col];
      res7+=asub[local_row+7*8][i].s7*bsub[8*i+7][local_col];
    }
    barrier(CLK_LOCAL_MEM_FENCE);
  }
  vstore8(res0, 0, c+(group_row*64+local_row+0*8)*ldc+(group_col*64+local_col*8));
  vstore8(res1, 0, c+(group_row*64+local_row+1*8)*ldc+(group_col*64+local_col*8));
  vstore8(res2, 0, c+(group_row*64+local_row+2*8)*ldc+(group_col*64+local_col*8));
  vstore8(res3, 0, c+(group_row*64+local_row+3*8)*ldc+(group_col*64+local_col*8));
  vstore8(res4, 0, c+(group_row*64+local_row+4*8)*ldc+(group_col*64+local_col*8));
  vstore8(res5, 0, c+(group_row*64+local_row+5*8)*ldc+(group_col*64+local_col*8));
  vstore8(res6, 0, c+(group_row*64+local_row+6*8)*ldc+(group_col*64+local_col*8));
  vstore8(res7, 0, c+(group_row*64+local_row+7*8)*ldc+(group_col*64+local_col*8));
}

BTW, The performance on Adreno 530 is really weird. For example, if I remove some code lines in this kernel's loop of multiply-accumulate, the performance is significantly dropped.

Dec 26 '17 11:12 wangjinwei94

@CNugteren @wang-jinwei I came across this (moskewcz/boda#13) GEMM implementation that gave 70 GFLOPS on Adreno 530.

typedef unsigned uint32_t;
__constant uint32_t const U32_MAX = 0xffffffff;
typedef int int32_t;
#72 "out_0.cl"
kernel void sgemm_simd__K_512__M_512__N_512__Mg_4__Ng_4__Mb_16__Nb_16__Kb_1__Mt_8__Nt_8__prof_variant_0__use_local_mem_2__vw_8( global float const * const a,
       global float const * const b,
       global float * const c )

{
  float c_r[8*8] = {0};
  float8 a_r[8/8];
  float8 b_r[8/8];

  int const a_off_thr = ( (get_group_id(0)/4)*16 + (get_local_id(0)/16) )*8/8*1;
  int const b_off_thr = ( (get_group_id(0)%4)*16 + (get_local_id(0)%16) )*8/8*1;

  int32_t a_off = a_off_thr;
  int32_t b_off = b_off_thr;
  for( int32_t k = 0; k < 512; k += 1 ) {

   a_r[0] = ((global float8 const *)a)[a_off+0];
   b_r[0] = ((global float8 const *)b)[b_off+0];
   c_r[0] += a_r[0].s0*b_r[0].s0;
   c_r[1] += a_r[0].s0*b_r[0].s1;
   c_r[2] += a_r[0].s0*b_r[0].s2;
   c_r[3] += a_r[0].s0*b_r[0].s3;
   c_r[4] += a_r[0].s0*b_r[0].s4;
   c_r[5] += a_r[0].s0*b_r[0].s5;
   c_r[6] += a_r[0].s0*b_r[0].s6;
   c_r[7] += a_r[0].s0*b_r[0].s7;
   c_r[8] += a_r[0].s1*b_r[0].s0;
   c_r[9] += a_r[0].s1*b_r[0].s1;
   c_r[10] += a_r[0].s1*b_r[0].s2;
   c_r[11] += a_r[0].s1*b_r[0].s3;
   c_r[12] += a_r[0].s1*b_r[0].s4;
   c_r[13] += a_r[0].s1*b_r[0].s5;
   c_r[14] += a_r[0].s1*b_r[0].s6;
   c_r[15] += a_r[0].s1*b_r[0].s7;
   c_r[16] += a_r[0].s2*b_r[0].s0;
   c_r[17] += a_r[0].s2*b_r[0].s1;
   c_r[18] += a_r[0].s2*b_r[0].s2;
   c_r[19] += a_r[0].s2*b_r[0].s3;
   c_r[20] += a_r[0].s2*b_r[0].s4;
   c_r[21] += a_r[0].s2*b_r[0].s5;
   c_r[22] += a_r[0].s2*b_r[0].s6;
   c_r[23] += a_r[0].s2*b_r[0].s7;
   c_r[24] += a_r[0].s3*b_r[0].s0;
   c_r[25] += a_r[0].s3*b_r[0].s1;
   c_r[26] += a_r[0].s3*b_r[0].s2;
   c_r[27] += a_r[0].s3*b_r[0].s3;
   c_r[28] += a_r[0].s3*b_r[0].s4;
   c_r[29] += a_r[0].s3*b_r[0].s5;
   c_r[30] += a_r[0].s3*b_r[0].s6;
   c_r[31] += a_r[0].s3*b_r[0].s7;
   c_r[32] += a_r[0].s4*b_r[0].s0;
   c_r[33] += a_r[0].s4*b_r[0].s1;
   c_r[34] += a_r[0].s4*b_r[0].s2;
   c_r[35] += a_r[0].s4*b_r[0].s3;
   c_r[36] += a_r[0].s4*b_r[0].s4;
   c_r[37] += a_r[0].s4*b_r[0].s5;
   c_r[38] += a_r[0].s4*b_r[0].s6;
   c_r[39] += a_r[0].s4*b_r[0].s7;
   c_r[40] += a_r[0].s5*b_r[0].s0;
   c_r[41] += a_r[0].s5*b_r[0].s1;
   c_r[42] += a_r[0].s5*b_r[0].s2;
   c_r[43] += a_r[0].s5*b_r[0].s3;
   c_r[44] += a_r[0].s5*b_r[0].s4;
   c_r[45] += a_r[0].s5*b_r[0].s5;
   c_r[46] += a_r[0].s5*b_r[0].s6;
   c_r[47] += a_r[0].s5*b_r[0].s7;
   c_r[48] += a_r[0].s6*b_r[0].s0;
   c_r[49] += a_r[0].s6*b_r[0].s1;
   c_r[50] += a_r[0].s6*b_r[0].s2;
   c_r[51] += a_r[0].s6*b_r[0].s3;
   c_r[52] += a_r[0].s6*b_r[0].s4;
   c_r[53] += a_r[0].s6*b_r[0].s5;
   c_r[54] += a_r[0].s6*b_r[0].s6;
   c_r[55] += a_r[0].s6*b_r[0].s7;
   c_r[56] += a_r[0].s7*b_r[0].s0;
   c_r[57] += a_r[0].s7*b_r[0].s1;
   c_r[58] += a_r[0].s7*b_r[0].s2;
   c_r[59] += a_r[0].s7*b_r[0].s3;
   c_r[60] += a_r[0].s7*b_r[0].s4;
   c_r[61] += a_r[0].s7*b_r[0].s5;
   c_r[62] += a_r[0].s7*b_r[0].s6;
   c_r[63] += a_r[0].s7*b_r[0].s7;

    a_off += 1*512/8;
    b_off += 1*512/8;
  }

  int32_t c_off =
    ((get_group_id(0)/4)*16+(get_local_id(0)/16))*8*512/8 +
    ((get_group_id(0)%4)*16+(get_local_id(0)%16))*8*1/8;

  for( int32_t Mt = 0; Mt < 8; ++Mt ) {

   switch(Mt) {
   case 0:
   b_r[0].s0 = c_r[0];
   b_r[0].s1 = c_r[1];
   b_r[0].s2 = c_r[2];
   b_r[0].s3 = c_r[3];
   b_r[0].s4 = c_r[4];
   b_r[0].s5 = c_r[5];
   b_r[0].s6 = c_r[6];
   b_r[0].s7 = c_r[7];
   break;
   case 1:
   b_r[0].s0 = c_r[8];
   b_r[0].s1 = c_r[9];
   b_r[0].s2 = c_r[10];
   b_r[0].s3 = c_r[11];
   b_r[0].s4 = c_r[12];
   b_r[0].s5 = c_r[13];
   b_r[0].s6 = c_r[14];
   b_r[0].s7 = c_r[15];
   break;
   case 2:
   b_r[0].s0 = c_r[16];
   b_r[0].s1 = c_r[17];
   b_r[0].s2 = c_r[18];
   b_r[0].s3 = c_r[19];
   b_r[0].s4 = c_r[20];
   b_r[0].s5 = c_r[21];
   b_r[0].s6 = c_r[22];
   b_r[0].s7 = c_r[23];
   break;
   case 3:
   b_r[0].s0 = c_r[24];
   b_r[0].s1 = c_r[25];
   b_r[0].s2 = c_r[26];
   b_r[0].s3 = c_r[27];
   b_r[0].s4 = c_r[28];
   b_r[0].s5 = c_r[29];
   b_r[0].s6 = c_r[30];
   b_r[0].s7 = c_r[31];
   break;
   case 4:
   b_r[0].s0 = c_r[32];
   b_r[0].s1 = c_r[33];
   b_r[0].s2 = c_r[34];
   b_r[0].s3 = c_r[35];
   b_r[0].s4 = c_r[36];
   b_r[0].s5 = c_r[37];
   b_r[0].s6 = c_r[38];
   b_r[0].s7 = c_r[39];
   break;
   case 5:
   b_r[0].s0 = c_r[40];
   b_r[0].s1 = c_r[41];
   b_r[0].s2 = c_r[42];
   b_r[0].s3 = c_r[43];
   b_r[0].s4 = c_r[44];
   b_r[0].s5 = c_r[45];
   b_r[0].s6 = c_r[46];
   b_r[0].s7 = c_r[47];
   break;
   case 6:
   b_r[0].s0 = c_r[48];
   b_r[0].s1 = c_r[49];
   b_r[0].s2 = c_r[50];
   b_r[0].s3 = c_r[51];
   b_r[0].s4 = c_r[52];
   b_r[0].s5 = c_r[53];
   b_r[0].s6 = c_r[54];
   b_r[0].s7 = c_r[55];
   break;
   case 7:
   b_r[0].s0 = c_r[56];
   b_r[0].s1 = c_r[57];
   b_r[0].s2 = c_r[58];
   b_r[0].s3 = c_r[59];
   b_r[0].s4 = c_r[60];
   b_r[0].s5 = c_r[61];
   b_r[0].s6 = c_r[62];
   b_r[0].s7 = c_r[63];
   break;
   }


   ((global float8 *)c)[c_off+0] = b_r[0];

    c_off += 512/8;
  }

}

Feb 06 '18 00:02 sivagnanamn

Thanks for letting us know! I'm planning to work on this again a bit in the coming weeks, I'll try to cover all the above kernels within CLBlast such that they will be automatically considered when running the tuners.

Feb 07 '18 19:02 CNugteren

@wang-jinwei "Actually, I found that the kernel provided by this tutorial can just deliver ~30 GFLOPS with m=n=k=1024 and Adreno 530 in my own project too, despite the ~23ms (~80 GFLOPS) reported by the author." I use their kernel and I can get ~28.5ms on Adreno 530 at 510MHz. If they use Adreno 530 at 624MHz it should be ~23.3ms. So their result repeatable.

(use global size = {1024 / 4, 1024 / 8} and local size = {32, 16}), M_SIZE = 1024

__kernel void SGEMM(__global float* A,
                    __read_only image2d_t B,
                    __global float* C,
                    const int M_SIZE)
{
    int gx = get_global_id(0);
    int gy = get_global_id(1) << 3;
    if ( (gx << 2) >= M_SIZE || gy >= M_SIZE ) {
      return;
   }
   float4 a[8];
   float4 b[4];
   float4 c[8];

    #pragma unroll
    for (int i = 0; i < 8; i++)
    {
        c[i] = 0.0f;
    }

    int A_y_off = gy * M_SIZE;

    for (int pos = 0; pos < M_SIZE; pos += 4)
    {
        int A_off = A_y_off + pos;

        #pragma unroll
        for (int i = 0; i < 4; i++)
        {
            b[i] = read_imagef(B, (int2)(gx, pos + i));
        }

        #pragma unroll
        for (int i = 0; i < 8; i++)
        {
            a[i] = vload4(0, A + A_off);
            A_off += M_SIZE;
        }

        #pragma unroll
        for (int i = 0; i < 8; i++)
        {
            c[i] += a[i].x * b[0] + a[i].y * b[1] + a[i].z * b[2] + a[i].w * b[3];
        }
    }

    #pragma unroll
    for (int i = 0; i < 8; i++)
    {
        int C_offs = (gy + i) * M_SIZE + (gx << 2);
        vstore4(c[i], 0, C + C_offs);
    }
}

@sivagnanamn About theoretical performance. I don't think you can rely on Adreno Wiki. It doesn't have any reference on official Qualcomm site. And I wasn't able to find any official information about latest (series 4xx, 5xx) Adreno GPU on their site. moskewcz mentioned that their microbenchmark can achieve ~256GFLOPS. And I can achieve ~230GFLOPS in my microbenchmark on Adreno 530 510MHz. And when I run this benchmark, Snapdragon profiler shows ~196% ALU utilization. (I don't know why most their metrics limited by 200% and not 100%). Also I have microbenchmark for half precision and I can get 430GHLOPS. (ALU utilization ~180% according to Snapdragon profiler).

@sivagnanamn You mentioned this numbers: "GEMM tuner ==> 115 GFLOPS GEMM client ==> 0.1 GFLOPS" Can you share code of GEMM where you can achieve 115GFLOPS?

Mar 03 '18 10:03 roserg

"GEMM tuner ==> 115 GFLOPS GEMM client ==> 0.1 GFLOPS" Can you share code of GEMM where you can achieve 115GFLOPS?

@roserg I got those results while trying out the adreno_tryout branch of CLBlast.

Mar 03 '18 12:03 sivagnanamn

@roserg I'm using Adreno 330 device with Linaro OS for my experiments. I couldn't use Snapdragon profiler with my device (no Android OS). All the matrices that I'm dealing with are rectangular in dimension. For ex: -m 16 -n 200704 -k 27.

Using image2d_t has max width & height restrictions (it supports max 4096x4096 on Adreno 330 OpenCL 1.1 Embedded profile). So I cannot use the GEMM sample provided in Snapdragon blog for such matrices.

Have you come across any kernels that can handle such rectangular matrices better?

Mar 05 '18 06:03 sivagnanamn

@sivagnanamn I'm not really interested in SGEMM. I want to do convolutions on Adreno GPU. I know that convolution can be calculated as matrix multiplication, so I searched good samples in this area. You can try change data layout for representation of your matrices, then it can fit in 4096x4096. Or you can try sample of @moskewcz it has the same performance without textures. But you have interesting dimensions, I think it can be optimized for this particular case.

So you got 115 GFLOPS on Adreno 330 for this particular dimensions -m 16 -n 200704 -k 27, am I right? Can you also tell the frequency?

Mar 05 '18 09:03 roserg

@roserg 578 MHz, but as @CNugteren suggested I wouldn't trust on 115 GFLOPS. The adreno_tryout branch isn't fully implemented, its still in trial & error stage. So we've to wait until its stable.

Mar 05 '18 10:03 sivagnanamn

Indeed, in the adreno_tryout branch the kernel assumes multiples of 16 or 32 or so, and you have e.g. k=27. That won't work. I hope to work on this again soon.

Mar 05 '18 12:03 CNugteren

For the record, I've been working on this issue the last week. Basically I took the kernel from the adreno_tryout branch and am generalising it with the regular CLBlast tuning parameters and making it possible to integrate it with the current GEMM kernel without having 2 kernels, but rather having one kernel with new tuning possibilities.

It is not an easy task, but the end is in sight :-) Soon I will ask anyone who's interested to try out some branch.

Mar 31 '18 19:03 CNugteren

As I said above, I took the kernel from the adreno_tryout branch and have fully integrated it into CLBlast. That means it is also tuneable with many of the same parameters as the regular kernel, so there is some potential now for Adreno GPUs.

Could you have a try? The branch is CLBlast-228-2d-register-gemm-kernel and you should just be able to run the tuner as normally. It will first tune the regular kernel (a fixed set of parameters followed by a random subset of parameters) and then the new kernel (again first a fixed set and then a random set). I'm curious to see what the results in terms of GFLOPS are for the tuner with the new kernel on your devices.

Apr 13 '18 20:04 CNugteren

Getting seg fault during tuning.


|  307 |   370 |    0   64   64   32    8    8    8    8    2    1    1    0    0    0    0    1 |   OK    447 ms |   2117.35 ms |    1.0 |     results match |
|  308 |   370 |    0   64   64   32    8    8    8    8    2    1    2    0    0    0    0    1 |   OK    404 ms |   1964.66 ms |    1.1 |     results match |
|  309 |   370 |    0   64   64   32    8    8    8    8    2    1    4    0    0    0    0    1 |   OK    500 ms |   2294.15 ms |    0.9 |     results match |
|  310 |   370 |    0   64   64   32    8    8    8    8    2    2    1    0    0    0    0    1 |   OK    336 ms |   3460.13 ms |    0.6 |     results match |
|  311 |   370 |    0   64   64   32    8    8    8    8    2    2    2    0    0    0    0    1 |   OK    356 ms |   3037.90 ms |    0.7 |     results match |
|  312 |   370 |    0   64   64   32    8    8    8    8    2    2    4    0    0    0    0    1 |   OK    405 ms |   4484.14 ms |    0.5 |     results match |
Segmentation fault

Apr 16 '18 04:04 sivagnanamn

OK, but that's perhaps an issue on your device/platform? I'll continue development a bit (some related tests still fail: SYRK/HERK) and then I'll test on some other platforms as well. Let's test again afterwards.

Apr 16 '18 18:04 CNugteren

I have completely finished the implementation and also tested on another machine, no issues seen so far. I'll soon merge the branch with master..

Did you see good performance while tuning so far? Otherwise you can try to limit the search space of the tuning parameters here to remove the configurations that might fail, e.g. remove MWG==64 from the search space.

Apr 20 '18 19:04 CNugteren

FYI, this branch is now merged into master.

Apr 21 '18 19:04 CNugteren

My intel results from this change (if relevant) are here.

Apr 22 '18 20:04 kodonnell

Thank you for your time @CNugteren .

I did try removing MWG==64 from src/tunining/kernels/xgemm.hpp, still tuning failed and caused device reboot. I've removed any option > 32 and started tuning again. I'll share the results once it is completed.

Apr 23 '18 06:04 sivagnanamn

Anyone can explain why Adreno OpenCL gpu performance too much bad ??

Can I increase Adreno OpenCL gpu performance?? Every time showing Adreno OpenCL gpu frequency only 1 MHz

Sdm 865 && Adreno 650 gpu Screenshot_2021-05-29-01-35-15-975_com termux

May 28 '21 21:05 Saikatsaha1996

Your question has no relation to CLBlast, but is more a question about your system. Before concluding too much, consider that it might perhaps be just a display issue of the clinfo tool (or of the OpenCL driver installed on your system). Most likely everything is just fast.

May 30 '21 12:05 CNugteren

As reported by sivagnanamn in issue #181, perhaps on Qualcomm Adreno GPUs is far from the theoretical maximum. This should be addressed by comparing reference GEMM implementations on Adreno hardware with the design of CLBlast. One example is in a Qualcomm tutorial part 1 and part 2, but not much else seems to be available.

Any links to fast OpenCL implementations on Adreno GPUs are welcome here. When I have some time I'll try to get hold of such a device and try some things out.

Can it possible to optimise adreno OpenCL kernel Just got reply from Qualcomm

https://developer.qualcomm.com/forum/qdn-forums/software/adreno-gpu-sdk/68728

Nov 30 '22 21:11 Saikatsaha1996

Can it possible to optimise adreno OpenCL kernel Just got reply from Qualcomm https://developer.qualcomm.com/forum/qdn-forums/software/adreno-gpu-sdk/68728

Thanks for sharing. Unfortunately I don't have time myself, but I'm happy to review any pull requests made.

Dec 12 '22 08:12 CNugteren

I'm closing this issue for now thanks to changes made in https://github.com/CNugteren/CLBlast/pull/451 and https://github.com/CNugteren/CLBlast/pull/452. If any particular issue shows up for a specific device, feel free to open an issue specific for it.

Jan 21 '23 20:01 CNugteren

CLBlast CLBlast copied to clipboard

Sub-optimal performance on Qualcomm Adreno GPUs

Overall performance improvement:

CLBlast
CLBlast copied to clipboard