CLBlast
CLBlast copied to clipboard
Sub-optimal performance on Qualcomm Adreno GPUs
As reported by sivagnanamn in issue #181, perhaps on Qualcomm Adreno GPUs is far from the theoretical maximum. This should be addressed by comparing reference GEMM implementations on Adreno hardware with the design of CLBlast. One example is in a Qualcomm tutorial part 1 and part 2, but not much else seems to be available.
Any links to fast OpenCL implementations on Adreno GPUs are welcome here. When I have some time I'll try to get hold of such a device and try some things out.
I've added a test branch (adreno_tryout
) in CLBlast to test the Qualcomm-provided kernel from the tutorial mentioned above. This is a very hacky integration of that kernel and is in no means meant to be actually used. However, it is there to be able to find out if that kernel does fix the performance issues with CLBlast. If so, I can work towards integration such a kernel properly. If not, we'll have to continue investigating. One thing suggested by the tutorial is using an OpenCL image for matrix B, but I didn't implement that.
The branch adreno_tryout
contains the Qualcomm-provided kernel and also a modified tuner to tune the local workgroup size. Could anyone with an Adreno GPU (e.g. @sivagnanamn) please do the following for me:
- First test performance with the latest
master
branch for reference, e.g../clblast_client_xgemm -m 256 -n 256 -k 256 -num_steps 4 -step 256
- Check out
adreno_tryout
and run./clblast_test_xgemm
to see if everything works OK. - Run the Qualcomm-specific tuner:
./clblast_tuner_xgemm
and share the output here. - Modify the numbers in
src/database/kernels/xgemm/xgemm_32.hpp
according to the output of the tuner. - Test performance again using
./clblast_client_xgemm
and compare with what you had before.
Notes: this branch is currently single-precision FP32 only and assumes alpha=1 and beta=0.
@CNugteren Thank you for your time.
I'm getting memory error while building the latest master with -DTUNERS=ON & -DCLIENTS=ON.
root@linaro:/home/linaro/CLBlast/build# make
-- Building CLBlast with OpenCL API (default)
-- Configuring done
-- Generating done
-- Build files have been written to: /home/linaro/CLBlast/build
Scanning dependencies of target clblast
[ 0%] Building CXX object CMakeFiles/clblast.dir/src/database/database.cpp.o
cc1plus: out of memory allocating 1126128 bytes after a total of 102879232 bytes
make[2]: *** [CMakeFiles/clblast.dir/src/database/database.cpp.o] Error 1
make[1]: *** [CMakeFiles/clblast.dir/all] Error 2
make: *** [all] Error 2
Available RAM (before "make"):
root@linaro-developer:/home/linaro/CLBlast/build# free -m
total used free shared buffers cached
Mem: 1626 98 1527 0 2 15
-/+ buffers/cache: 80 1546
Available HDD space (before "make"):
root@linaro:/home/linaro/CLBlast/build# df -h
Size Used Avail Use% Mounted on
7.9G 5.6G 2.3G 71% /
I'm currently checking with an older commit (249bdaa8e9a111573f5c3a821230bba6437817c7) Dec 19,2017. I'll share the output asap.
@CNugteren
First test performance with the latest master branch for reference, e.g. ./clblast_client_xgemm -m 256 -n 256 -k 256 -num_steps 4 -step 256
| <-- CLBlast --> | <-- clBLAS --> | <-- CPU BLAS --> |
m; n; k; layout; transA; transB; lda; ldb; ldc; offa; offb; offc; alpha; beta; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2; ms_3; GFLOPS_3; GBs_3
256; 256; 256; 101; 111; 111; 256; 256; 256; 0; 0; 0; 2.00; 2.00; 24.93; 1.3; 0.0; 35.19; 1.0; 0.0; 31.43; 1.1; 0.0
512; 512; 512; 101; 111; 111; 512; 512; 512; 0; 0; 0; 2.00; 2.00; 36.69; 7.3; 0.1; 238.70; 1.1; 0.0; 253.38; 1.1; 0.0
768; 768; 768; 101; 111; 111; 768; 768; 768; 0; 0; 0; 2.00; 2.00; 131.05; 6.9; 0.1; 818.74; 1.1; 0.0; 874.82; 1.0; 0.0
1K; 1K; 1K; 101; 111; 111; 1K; 1K; 1K; 0; 0; 0; 2.00; 2.00; 253.07; 8.5; 0.1; 1874.76; 1.1; 0.0; 2319.10; 0.9; 0.0
Check out adreno_tryout and run ./clblast_test_xgemm to see if everything works OK.
* Completed all test-cases for this routine. Results:
234 test(s) passed
278 test(s) skipped
0 test(s) failed
Run the Qualcomm-specific tuner: ./clblast_tuner_xgemm and share the output here.
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
-precision 32 (single) [=default]
-m 1024 [=default]
-n 1024 [=default]
-k 1024 [=default]
-alpha 2.00 [=default]
-beta 2.00 [=default]
-fraction 1.00 [=default]
-runs 2 [=default]
-max_l2_norm 0.00 [=default]
* Found best result 276.22 ms: 7.8 GFLOPS
* Best parameters: KWG=4 KWI=1 MDIMA=8 MDIMC=8 MWG=32 NDIMB=4 NDIMC=4 NWG=32 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4
Qualcomm XGEMM Tuner JSON output ==> clblast_xgemm_1_32.json.zip
Modify the numbers in src/database/kernels/xgemm/xgemm_32.hpp according to the output of the tuner. Test performance again using ./clblast_client_xgemm and compare with what you had before.
| <-- CLBlast --> | <-- clBLAS --> | <-- CPU BLAS --> |
m; n; k; layout; transA; transB; lda; ldb; ldc; offa; offb; offc; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2; ms_3; GFLOPS_3; GBs_3
256; 256; 256; 101; 111; 111; 256; 256; 256; 0; 0; 0; 7.24; 4.6; 0.1; 31.69; 1.1; 0.0; 31.43; 1.1; 0.0
512; 512; 512; 101; 111; 111; 512; 512; 512; 0; 0; 0; 36.15; 7.4; 0.1; 248.76; 1.1; 0.0; 280.43; 1.0; 0.0
768; 768; 768; 101; 111; 111; 768; 768; 768; 0; 0; 0; 109.09; 8.3; 0.1; 848.87; 1.1; 0.0; 1120.44; 0.8; 0.0
1K; 1K; 1K; 101; 111; 111; 1K; 1K; 1K; 0; 0; 0; 274.32; 7.8; 0.1; 1960.96; 1.1; 0.0; 3284.81; 0.7; 0.0
Overall performance improvement:
With the adreno_try
branch, I got ~150-200 ms improvement in overall execution speed of my DNN inference engine.
Ex: ./clblast_client_xgemm -m 32 -n 50176 -k 144 -alpha 1.00 -beta 0.00 -precision 32
master branch (249bdaa8e9a111573f5c3a821230bba6437817c7)
| <-- CLBlast --> | <-- clBLAS --> | <-- CPU BLAS --> |
m; n; k; layout; transA; transB; lda; ldb; ldc; offa; offb; offc; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2; ms_3; GFLOPS_3; GBs_3
32; 49K; 144; 101; 111; 111; 144; 49K; 49K; 0; 0; 0; 187.26; 2.5; 0.2; 482.41; 1.0; 0.1; 633.15; 0.7; 0.1
adreno_try
branch
| <-- CLBlast --> | <-- clBLAS --> | <-- CPU BLAS --> |
m; n; k; layout; transA; transB; lda; ldb; ldc; offa; offb; offc; alpha; beta; ms_1; GFLOPS_1; GBs_1; ms_2; GFLOPS_2; GBs_2; ms_3; GFLOPS_3; GBs_3
32; 49K; 144; 101; 111; 111; 144; 49K; 49K; 0; 0; 0; 1.00; 0.00; 51.29; 9.0; 0.8; 489.78; 0.9; 0.1; 830.61; 0.6; 0.1
Thanks again for your time.
@CNugteren
Actually, master branch (bd540829ea1954c3b367ec70aa8c5811b643422b) works better than adreno_try branch on my Xiaomi 5 with Snapdragon 820 and Adreno 530 GPU. With master branch, clblast_tuner_xgemm reports about 42 GFLOPS for default setting, while with adreno_try branch, the number is 31.2 GFLOPS.
@wang-jinwei Thank you very much for sharing the GFLOPS details.
GPU | Theoritical GFLOPS | CLBlast |
---|---|---|
Adreno 330 | 166.5 (578 MHz) | 7~10 |
Adreno 530 | 519 (650 MHz) | 42 |
Source: Adreno Wiki
So I believe still there's scope for improvement. Thanks to @CNugteren 's efforts for reducing this gap.
Please correct me if I'm wrong.
Using the current adreno_tryout branch, I tried to tune GEMM for a custom MNK size(results shown below):
The best tuner configuration gave 115 GFLOPS at 1.51ms.
./clblast_tuner_xgemm -m 16 -n 200704 -k 27 -alpha 1.00 -beta 0.00 -precision 32
| ID | total | param | compiles | time | GFLOPS | status |
x------x-------x-----------------------------------------------------------------------x----------------x--------------x--------x-------------------x
| ref | - | - | OK | 35.66 ms | - | reference OK |
x------x-------x-----------------------------------------------------------------------x----------------x--------------x--------x-------------------x
| 1 | 30 | 8 16 4 2 2 2 2 1 4 4 0 0 0 0 | OK 270 ms | 2.29 ms | - | L2 error 1.22e-02 | <-- skipping
| 2 | 30 | 8 32 4 2 4 2 4 1 4 4 0 0 0 0 | OK 312 ms | 1.31 ms | - | L2 error 1.22e-02 | <-- skipping
| 3 | 30 | 8 64 4 2 8 2 8 1 4 4 0 0 0 0 | OK 311 ms | 0.81 ms | - | L2 error 1.22e-02 | <-- skipping
| 4 | 30 | 8 128 4 2 16 2 16 1 4 4 0 0 0 0 | OK 289 ms | 0.57 ms | - | L2 error 1.22e-02 | <-- skipping
| 5 | 30 | 8 256 4 2 32 2 32 1 4 4 0 0 0 0 | OK 330 ms | 2.61 ms | - | L2 error 1.22e-02 | <-- skipping
| 6 | 30 | 16 16 4 4 2 4 2 1 4 4 0 0 0 0 | OK 271 ms | 1.30 ms | - | L2 error 1.22e-02 | <-- skipping
| 7 | 30 | 16 32 4 4 4 4 4 1 4 4 0 0 0 0 | OK 273 ms | 0.80 ms | - | L2 error 1.22e-02 | <-- skipping
| 8 | 30 | 16 64 4 4 8 4 8 1 4 4 0 0 0 0 | OK 291 ms | 0.65 ms | - | L2 error 1.22e-02 | <-- skipping
| 9 | 30 | 16 128 4 4 16 4 16 1 4 4 0 0 0 0 | OK 284 ms | 2.76 ms | - | L2 error 1.22e-02 | <-- skipping
| 10 | 30 | 16 256 4 4 32 4 32 1 4 4 0 0 0 0 | OK 292 ms | 1.88 ms | - | L2 error 1.22e-02 | <-- skipping
| 11 | 30 | 32 16 4 8 2 8 2 1 4 4 0 0 0 0 | OK 273 ms | 1.27 ms | - | L2 error 8.36e-03 | <-- skipping
| 12 | 30 | 32 32 4 8 4 8 4 1 4 4 0 0 0 0 | OK 277 ms | 0.88 ms | - | L2 error 8.36e-03 | <-- skipping
| 13 | 30 | 32 64 4 8 8 8 8 1 4 4 0 0 0 0 | OK 311 ms | 5.48 ms | - | L2 error 8.36e-03 | <-- skipping
| 14 | 30 | 32 128 4 8 16 8 16 1 4 4 0 0 0 0 | OK 291 ms | 3.51 ms | - | L2 error 8.36e-03 | <-- skipping
| 15 | 30 | 32 256 4 8 32 8 32 1 4 4 0 0 0 0 | OK 324 ms | 3.69 ms | - | L2 error 8.36e-03 | <-- skipping
| 16 | 30 | 64 16 4 16 2 16 2 1 4 4 0 0 0 0 | OK 275 ms | 1.51 ms | 115.1 | results match |
| 17 | 30 | 64 32 4 16 4 16 4 1 4 4 0 0 0 0 | OK 289 ms | 7.14 ms | 24.3 | results match |
| 18 | 30 | 64 64 4 16 8 16 8 1 4 4 0 0 0 0 | OK 299 ms | 4.71 ms | 36.8 | results match |
| 19 | 30 | 64 128 4 16 16 16 16 1 4 4 0 0 0 0 | OK 307 ms | 11.06 ms | 15.7 | results match |
| 20 | 30 | 64 256 4 16 32 16 32 1 4 4 0 0 0 0 | OK 304 ms | error -54 | - | invalid config. | <-- skipping
| 21 | 30 | 128 16 4 32 2 32 2 1 4 4 0 0 0 0 | OK 292 ms | 21.01 ms | - | L2 error 1.60e-02 | <-- skipping
| 22 | 30 | 128 32 4 32 4 32 4 1 4 4 0 0 0 0 | OK 330 ms | 8.93 ms | - | L2 error 1.60e-02 | <-- skipping
| 23 | 30 | 128 64 4 32 8 32 8 1 4 4 0 0 0 0 | OK 300 ms | 21.38 ms | - | L2 error 1.60e-02 | <-- skipping
| 24 | 30 | 128 128 4 32 16 32 16 1 4 4 0 0 0 0 | OK 328 ms | error -54 | - | invalid config. | <-- skipping
| 25 | 30 | 128 256 4 32 32 32 32 1 4 4 0 0 0 0 | OK 324 ms | error -54 | - | invalid config. | <-- skipping
| 26 | 30 | 256 16 4 64 2 64 2 1 4 4 0 0 0 0 | OK 308 ms | 18.21 ms | - | L2 error 4.63e-02 | <-- skipping
| 27 | 30 | 256 32 4 64 4 64 4 1 4 4 0 0 0 0 | OK 325 ms | 22.91 ms | - | L2 error 4.63e-02 | <-- skipping
| 28 | 30 | 256 64 4 64 8 64 8 1 4 4 0 0 0 0 | compilation error: -6 | - | - | <-- skipping
| 29 | 30 | 256 128 4 64 16 64 16 1 4 4 0 0 0 0 | OK 337 ms | error -54 | - | invalid config. | <-- skipping
| 30 | 30 | 256 256 4 64 32 64 32 1 4 4 0 0 0 0 | OK 290 ms | error -54 | - | invalid config. | <-- skipping
x------x-------x-----------------------------------------------------------------------x----------------x--------------x--------x-------------------x
* Found best result 1.51 ms: 115.1 GFLOPS
* Best parameters: KWG=4 KWI=1 MDIMA=16 MDIMC=16 MWG=64 NDIMB=2 NDIMC=2 NWG=16 PRECISION=32 SA=0 SB=0 STRM=0 STRN=0 VWM=4 VWN=4
* Writing a total of 4 results to 'clblast_xgemm_1_32.json'
* Completed tuning process
After tuning, updated src/database/kernels/xgemm/xgemm_32.hpp
with the above parameter setting & tested using ./clblast_client_xgemm -m 16 -n 200704 -k 27 -alpha 1.00 -beta 0.00 -precision 32
GEMM tuner ==> 115 GFLOPS GEMM client ==> 0.1 GFLOPS
This is just an experimental branch, but still is this an expected behaviour? Please share your thoughts.
@CNugteren For your reference, I found a sample GEMM example from Qualcomm official Adreno SDK (SDK link).
Thanks both for trying out, very useful!
@sivagnanamn: I'm working on improving compilation. This database.cpp
was always a tricky one, I improved things over time, but I will now try to split it up to avoid long compilation times and excessive memory usage.
@sivagnanamn: I wouldn't try other sizes right now for tuning. Given that almost all results point out some error, don't trust them. I guess this is because the kernel assumes multiples of 16 or 32 or so, and you have e.g. k=27
. That won't work.
So, about the results now, it seems the Qualcomm-provided kernel I plugged in is not ideal yet. I would want to focus first on the use-case of m=n=k=1024
, then afterwards we will see other cases. And then the numbers reported by the tuner are most important, since the client might also run other pre/post kernels which we are currently not investigating.
So, at m=n=k=1024
, @sivagnanamn reports roughly the same performance for the master branch and the try-out branch (both around 8 GFLOPS). @wang-jinwei even reports slightly decreased performance (40 -> 30 GFLOPS). So it seems the Qualcomm-provided kernel is not the holy-grail, although it might be better under certain circumstances (I even saw better performance on my Intel GPU). Conclusion for now is that I'll try to properly integrate this kernel/configuration in the whole CLBlast infrastructure, and if it is useful for a particular device, the tuner will find it.
By the way, is there any reference of what we could expect? The tutorial has a graph with 3 results of m=n=k=1024
with 23ms for the Adreno 530. That would translate to ~80 GFLOPS if I'm right, so we are a factor ~2 off it seems.
Next thing I would like to try is look at the use of images instead of regular OpenCL memory as pointed out in the tutorial. I will also look at the official SDK as you pointed out above, thanks.
@CNugteren
Thanks for your time again. Actually, I found that the kernel provided by this tutorial can just deliver ~30 GFLOPS with m=n=k=1024 and Adreno 530 in my own project too, despite the ~23ms (~80 GFLOPS) reported by the author.
But there is a kernel may achieve ~55 GFLOPS for m=n=k=1024 on Adreno 530: (use global size = {128, 128} and local size = {8, 8})
__kernel void gemm(int m, int n, int k, __global const float* a, int lda,
__global const float* b, int ldb, __global float* c, int ldc) {
const int local_col=get_local_id(0);
const int local_row=get_local_id(1);
const int group_col=get_group_id(0);
const int group_row=get_group_id(1);
__local float8 asub[64][8];
__local float8 bsub[64][8];
__local float* asub_buf=(__local float*)asub;
__local float* bsub_buf=(__local float*)bsub;
float8 res0=0;
float8 res1=0;
float8 res2=0;
float8 res3=0;
float8 res4=0;
float8 res5=0;
float8 res6=0;
float8 res7=0;
const int tile=k/64;
for(int t=0; t<tile; t++) {
for(int i=0; i<8; i++) {
for(int j=0; j<8; j++) {
asub_buf[(local_row+i*8)*64+(local_col+j*8)]=a[(group_row*64+local_row+i*8)*lda+(t*64+local_col+j*8)];
bsub_buf[(local_row+i*8)*64+(local_col+j*8)]=b[(t*64+local_row+i*8)*ldb+(group_col*64+local_col+j*8)];
}
}
barrier(CLK_LOCAL_MEM_FENCE);
for(int i=0; i<8; i++) {
res0+=asub[local_row+0*8][i].s0*bsub[8*i+0][local_col];
res1+=asub[local_row+1*8][i].s0*bsub[8*i+0][local_col];
res2+=asub[local_row+2*8][i].s0*bsub[8*i+0][local_col];
res3+=asub[local_row+3*8][i].s0*bsub[8*i+0][local_col];
res4+=asub[local_row+4*8][i].s0*bsub[8*i+0][local_col];
res5+=asub[local_row+5*8][i].s0*bsub[8*i+0][local_col];
res6+=asub[local_row+6*8][i].s0*bsub[8*i+0][local_col];
res7+=asub[local_row+7*8][i].s0*bsub[8*i+0][local_col];
res0+=asub[local_row+0*8][i].s1*bsub[8*i+1][local_col];
res1+=asub[local_row+1*8][i].s1*bsub[8*i+1][local_col];
res2+=asub[local_row+2*8][i].s1*bsub[8*i+1][local_col];
res3+=asub[local_row+3*8][i].s1*bsub[8*i+1][local_col];
res4+=asub[local_row+4*8][i].s1*bsub[8*i+1][local_col];
res5+=asub[local_row+5*8][i].s1*bsub[8*i+1][local_col];
res6+=asub[local_row+6*8][i].s1*bsub[8*i+1][local_col];
res7+=asub[local_row+7*8][i].s1*bsub[8*i+1][local_col];
res0+=asub[local_row+0*8][i].s2*bsub[8*i+2][local_col];
res1+=asub[local_row+1*8][i].s2*bsub[8*i+2][local_col];
res2+=asub[local_row+2*8][i].s2*bsub[8*i+2][local_col];
res3+=asub[local_row+3*8][i].s2*bsub[8*i+2][local_col];
res4+=asub[local_row+4*8][i].s2*bsub[8*i+2][local_col];
res5+=asub[local_row+5*8][i].s2*bsub[8*i+2][local_col];
res6+=asub[local_row+6*8][i].s2*bsub[8*i+2][local_col];
res7+=asub[local_row+7*8][i].s2*bsub[8*i+2][local_col];
res0+=asub[local_row+0*8][i].s3*bsub[8*i+3][local_col];
res1+=asub[local_row+1*8][i].s3*bsub[8*i+3][local_col];
res2+=asub[local_row+2*8][i].s3*bsub[8*i+3][local_col];
res3+=asub[local_row+3*8][i].s3*bsub[8*i+3][local_col];
res4+=asub[local_row+4*8][i].s3*bsub[8*i+3][local_col];
res5+=asub[local_row+5*8][i].s3*bsub[8*i+3][local_col];
res6+=asub[local_row+6*8][i].s3*bsub[8*i+3][local_col];
res7+=asub[local_row+7*8][i].s3*bsub[8*i+3][local_col];
res0+=asub[local_row+0*8][i].s4*bsub[8*i+4][local_col];
res1+=asub[local_row+1*8][i].s4*bsub[8*i+4][local_col];
res2+=asub[local_row+2*8][i].s4*bsub[8*i+4][local_col];
res3+=asub[local_row+3*8][i].s4*bsub[8*i+4][local_col];
res4+=asub[local_row+4*8][i].s4*bsub[8*i+4][local_col];
res5+=asub[local_row+5*8][i].s4*bsub[8*i+4][local_col];
res6+=asub[local_row+6*8][i].s4*bsub[8*i+4][local_col];
res7+=asub[local_row+7*8][i].s4*bsub[8*i+4][local_col];
res0+=asub[local_row+0*8][i].s5*bsub[8*i+5][local_col];
res1+=asub[local_row+1*8][i].s5*bsub[8*i+5][local_col];
res2+=asub[local_row+2*8][i].s5*bsub[8*i+5][local_col];
res3+=asub[local_row+3*8][i].s5*bsub[8*i+5][local_col];
res4+=asub[local_row+4*8][i].s5*bsub[8*i+5][local_col];
res5+=asub[local_row+5*8][i].s5*bsub[8*i+5][local_col];
res6+=asub[local_row+6*8][i].s5*bsub[8*i+5][local_col];
res7+=asub[local_row+7*8][i].s5*bsub[8*i+5][local_col];
res0+=asub[local_row+0*8][i].s6*bsub[8*i+6][local_col];
res1+=asub[local_row+1*8][i].s6*bsub[8*i+6][local_col];
res2+=asub[local_row+2*8][i].s6*bsub[8*i+6][local_col];
res3+=asub[local_row+3*8][i].s6*bsub[8*i+6][local_col];
res4+=asub[local_row+4*8][i].s6*bsub[8*i+6][local_col];
res5+=asub[local_row+5*8][i].s6*bsub[8*i+6][local_col];
res6+=asub[local_row+6*8][i].s6*bsub[8*i+6][local_col];
res7+=asub[local_row+7*8][i].s6*bsub[8*i+6][local_col];
res0+=asub[local_row+0*8][i].s7*bsub[8*i+7][local_col];
res1+=asub[local_row+1*8][i].s7*bsub[8*i+7][local_col];
res2+=asub[local_row+2*8][i].s7*bsub[8*i+7][local_col];
res3+=asub[local_row+3*8][i].s7*bsub[8*i+7][local_col];
res4+=asub[local_row+4*8][i].s7*bsub[8*i+7][local_col];
res5+=asub[local_row+5*8][i].s7*bsub[8*i+7][local_col];
res6+=asub[local_row+6*8][i].s7*bsub[8*i+7][local_col];
res7+=asub[local_row+7*8][i].s7*bsub[8*i+7][local_col];
}
barrier(CLK_LOCAL_MEM_FENCE);
}
vstore8(res0, 0, c+(group_row*64+local_row+0*8)*ldc+(group_col*64+local_col*8));
vstore8(res1, 0, c+(group_row*64+local_row+1*8)*ldc+(group_col*64+local_col*8));
vstore8(res2, 0, c+(group_row*64+local_row+2*8)*ldc+(group_col*64+local_col*8));
vstore8(res3, 0, c+(group_row*64+local_row+3*8)*ldc+(group_col*64+local_col*8));
vstore8(res4, 0, c+(group_row*64+local_row+4*8)*ldc+(group_col*64+local_col*8));
vstore8(res5, 0, c+(group_row*64+local_row+5*8)*ldc+(group_col*64+local_col*8));
vstore8(res6, 0, c+(group_row*64+local_row+6*8)*ldc+(group_col*64+local_col*8));
vstore8(res7, 0, c+(group_row*64+local_row+7*8)*ldc+(group_col*64+local_col*8));
}
BTW, The performance on Adreno 530 is really weird. For example, if I remove some code lines in this kernel's loop of multiply-accumulate, the performance is significantly dropped.
@CNugteren @wang-jinwei I came across this (moskewcz/boda#13) GEMM implementation that gave 70 GFLOPS on Adreno 530.
typedef unsigned uint32_t;
__constant uint32_t const U32_MAX = 0xffffffff;
typedef int int32_t;
#72 "out_0.cl"
kernel void sgemm_simd__K_512__M_512__N_512__Mg_4__Ng_4__Mb_16__Nb_16__Kb_1__Mt_8__Nt_8__prof_variant_0__use_local_mem_2__vw_8( global float const * const a,
global float const * const b,
global float * const c )
{
float c_r[8*8] = {0};
float8 a_r[8/8];
float8 b_r[8/8];
int const a_off_thr = ( (get_group_id(0)/4)*16 + (get_local_id(0)/16) )*8/8*1;
int const b_off_thr = ( (get_group_id(0)%4)*16 + (get_local_id(0)%16) )*8/8*1;
int32_t a_off = a_off_thr;
int32_t b_off = b_off_thr;
for( int32_t k = 0; k < 512; k += 1 ) {
a_r[0] = ((global float8 const *)a)[a_off+0];
b_r[0] = ((global float8 const *)b)[b_off+0];
c_r[0] += a_r[0].s0*b_r[0].s0;
c_r[1] += a_r[0].s0*b_r[0].s1;
c_r[2] += a_r[0].s0*b_r[0].s2;
c_r[3] += a_r[0].s0*b_r[0].s3;
c_r[4] += a_r[0].s0*b_r[0].s4;
c_r[5] += a_r[0].s0*b_r[0].s5;
c_r[6] += a_r[0].s0*b_r[0].s6;
c_r[7] += a_r[0].s0*b_r[0].s7;
c_r[8] += a_r[0].s1*b_r[0].s0;
c_r[9] += a_r[0].s1*b_r[0].s1;
c_r[10] += a_r[0].s1*b_r[0].s2;
c_r[11] += a_r[0].s1*b_r[0].s3;
c_r[12] += a_r[0].s1*b_r[0].s4;
c_r[13] += a_r[0].s1*b_r[0].s5;
c_r[14] += a_r[0].s1*b_r[0].s6;
c_r[15] += a_r[0].s1*b_r[0].s7;
c_r[16] += a_r[0].s2*b_r[0].s0;
c_r[17] += a_r[0].s2*b_r[0].s1;
c_r[18] += a_r[0].s2*b_r[0].s2;
c_r[19] += a_r[0].s2*b_r[0].s3;
c_r[20] += a_r[0].s2*b_r[0].s4;
c_r[21] += a_r[0].s2*b_r[0].s5;
c_r[22] += a_r[0].s2*b_r[0].s6;
c_r[23] += a_r[0].s2*b_r[0].s7;
c_r[24] += a_r[0].s3*b_r[0].s0;
c_r[25] += a_r[0].s3*b_r[0].s1;
c_r[26] += a_r[0].s3*b_r[0].s2;
c_r[27] += a_r[0].s3*b_r[0].s3;
c_r[28] += a_r[0].s3*b_r[0].s4;
c_r[29] += a_r[0].s3*b_r[0].s5;
c_r[30] += a_r[0].s3*b_r[0].s6;
c_r[31] += a_r[0].s3*b_r[0].s7;
c_r[32] += a_r[0].s4*b_r[0].s0;
c_r[33] += a_r[0].s4*b_r[0].s1;
c_r[34] += a_r[0].s4*b_r[0].s2;
c_r[35] += a_r[0].s4*b_r[0].s3;
c_r[36] += a_r[0].s4*b_r[0].s4;
c_r[37] += a_r[0].s4*b_r[0].s5;
c_r[38] += a_r[0].s4*b_r[0].s6;
c_r[39] += a_r[0].s4*b_r[0].s7;
c_r[40] += a_r[0].s5*b_r[0].s0;
c_r[41] += a_r[0].s5*b_r[0].s1;
c_r[42] += a_r[0].s5*b_r[0].s2;
c_r[43] += a_r[0].s5*b_r[0].s3;
c_r[44] += a_r[0].s5*b_r[0].s4;
c_r[45] += a_r[0].s5*b_r[0].s5;
c_r[46] += a_r[0].s5*b_r[0].s6;
c_r[47] += a_r[0].s5*b_r[0].s7;
c_r[48] += a_r[0].s6*b_r[0].s0;
c_r[49] += a_r[0].s6*b_r[0].s1;
c_r[50] += a_r[0].s6*b_r[0].s2;
c_r[51] += a_r[0].s6*b_r[0].s3;
c_r[52] += a_r[0].s6*b_r[0].s4;
c_r[53] += a_r[0].s6*b_r[0].s5;
c_r[54] += a_r[0].s6*b_r[0].s6;
c_r[55] += a_r[0].s6*b_r[0].s7;
c_r[56] += a_r[0].s7*b_r[0].s0;
c_r[57] += a_r[0].s7*b_r[0].s1;
c_r[58] += a_r[0].s7*b_r[0].s2;
c_r[59] += a_r[0].s7*b_r[0].s3;
c_r[60] += a_r[0].s7*b_r[0].s4;
c_r[61] += a_r[0].s7*b_r[0].s5;
c_r[62] += a_r[0].s7*b_r[0].s6;
c_r[63] += a_r[0].s7*b_r[0].s7;
a_off += 1*512/8;
b_off += 1*512/8;
}
int32_t c_off =
((get_group_id(0)/4)*16+(get_local_id(0)/16))*8*512/8 +
((get_group_id(0)%4)*16+(get_local_id(0)%16))*8*1/8;
for( int32_t Mt = 0; Mt < 8; ++Mt ) {
switch(Mt) {
case 0:
b_r[0].s0 = c_r[0];
b_r[0].s1 = c_r[1];
b_r[0].s2 = c_r[2];
b_r[0].s3 = c_r[3];
b_r[0].s4 = c_r[4];
b_r[0].s5 = c_r[5];
b_r[0].s6 = c_r[6];
b_r[0].s7 = c_r[7];
break;
case 1:
b_r[0].s0 = c_r[8];
b_r[0].s1 = c_r[9];
b_r[0].s2 = c_r[10];
b_r[0].s3 = c_r[11];
b_r[0].s4 = c_r[12];
b_r[0].s5 = c_r[13];
b_r[0].s6 = c_r[14];
b_r[0].s7 = c_r[15];
break;
case 2:
b_r[0].s0 = c_r[16];
b_r[0].s1 = c_r[17];
b_r[0].s2 = c_r[18];
b_r[0].s3 = c_r[19];
b_r[0].s4 = c_r[20];
b_r[0].s5 = c_r[21];
b_r[0].s6 = c_r[22];
b_r[0].s7 = c_r[23];
break;
case 3:
b_r[0].s0 = c_r[24];
b_r[0].s1 = c_r[25];
b_r[0].s2 = c_r[26];
b_r[0].s3 = c_r[27];
b_r[0].s4 = c_r[28];
b_r[0].s5 = c_r[29];
b_r[0].s6 = c_r[30];
b_r[0].s7 = c_r[31];
break;
case 4:
b_r[0].s0 = c_r[32];
b_r[0].s1 = c_r[33];
b_r[0].s2 = c_r[34];
b_r[0].s3 = c_r[35];
b_r[0].s4 = c_r[36];
b_r[0].s5 = c_r[37];
b_r[0].s6 = c_r[38];
b_r[0].s7 = c_r[39];
break;
case 5:
b_r[0].s0 = c_r[40];
b_r[0].s1 = c_r[41];
b_r[0].s2 = c_r[42];
b_r[0].s3 = c_r[43];
b_r[0].s4 = c_r[44];
b_r[0].s5 = c_r[45];
b_r[0].s6 = c_r[46];
b_r[0].s7 = c_r[47];
break;
case 6:
b_r[0].s0 = c_r[48];
b_r[0].s1 = c_r[49];
b_r[0].s2 = c_r[50];
b_r[0].s3 = c_r[51];
b_r[0].s4 = c_r[52];
b_r[0].s5 = c_r[53];
b_r[0].s6 = c_r[54];
b_r[0].s7 = c_r[55];
break;
case 7:
b_r[0].s0 = c_r[56];
b_r[0].s1 = c_r[57];
b_r[0].s2 = c_r[58];
b_r[0].s3 = c_r[59];
b_r[0].s4 = c_r[60];
b_r[0].s5 = c_r[61];
b_r[0].s6 = c_r[62];
b_r[0].s7 = c_r[63];
break;
}
((global float8 *)c)[c_off+0] = b_r[0];
c_off += 512/8;
}
}
Thanks for letting us know! I'm planning to work on this again a bit in the coming weeks, I'll try to cover all the above kernels within CLBlast such that they will be automatically considered when running the tuners.
@wang-jinwei "Actually, I found that the kernel provided by this tutorial can just deliver ~30 GFLOPS with m=n=k=1024 and Adreno 530 in my own project too, despite the ~23ms (~80 GFLOPS) reported by the author." I use their kernel and I can get ~28.5ms on Adreno 530 at 510MHz. If they use Adreno 530 at 624MHz it should be ~23.3ms. So their result repeatable.
(use global size = {1024 / 4, 1024 / 8} and local size = {32, 16}), M_SIZE = 1024
__kernel void SGEMM(__global float* A,
__read_only image2d_t B,
__global float* C,
const int M_SIZE)
{
int gx = get_global_id(0);
int gy = get_global_id(1) << 3;
if ( (gx << 2) >= M_SIZE || gy >= M_SIZE ) {
return;
}
float4 a[8];
float4 b[4];
float4 c[8];
#pragma unroll
for (int i = 0; i < 8; i++)
{
c[i] = 0.0f;
}
int A_y_off = gy * M_SIZE;
for (int pos = 0; pos < M_SIZE; pos += 4)
{
int A_off = A_y_off + pos;
#pragma unroll
for (int i = 0; i < 4; i++)
{
b[i] = read_imagef(B, (int2)(gx, pos + i));
}
#pragma unroll
for (int i = 0; i < 8; i++)
{
a[i] = vload4(0, A + A_off);
A_off += M_SIZE;
}
#pragma unroll
for (int i = 0; i < 8; i++)
{
c[i] += a[i].x * b[0] + a[i].y * b[1] + a[i].z * b[2] + a[i].w * b[3];
}
}
#pragma unroll
for (int i = 0; i < 8; i++)
{
int C_offs = (gy + i) * M_SIZE + (gx << 2);
vstore4(c[i], 0, C + C_offs);
}
}
@sivagnanamn About theoretical performance. I don't think you can rely on Adreno Wiki. It doesn't have any reference on official Qualcomm site. And I wasn't able to find any official information about latest (series 4xx, 5xx) Adreno GPU on their site. moskewcz mentioned that their microbenchmark can achieve ~256GFLOPS. And I can achieve ~230GFLOPS in my microbenchmark on Adreno 530 510MHz. And when I run this benchmark, Snapdragon profiler shows ~196% ALU utilization. (I don't know why most their metrics limited by 200% and not 100%). Also I have microbenchmark for half precision and I can get 430GHLOPS. (ALU utilization ~180% according to Snapdragon profiler).
@sivagnanamn You mentioned this numbers: "GEMM tuner ==> 115 GFLOPS GEMM client ==> 0.1 GFLOPS" Can you share code of GEMM where you can achieve 115GFLOPS?
"GEMM tuner ==> 115 GFLOPS GEMM client ==> 0.1 GFLOPS" Can you share code of GEMM where you can achieve 115GFLOPS?
@roserg I got those results while trying out the adreno_tryout branch of CLBlast.
@roserg I'm using Adreno 330 device with Linaro OS for my experiments. I couldn't use Snapdragon profiler with my device (no Android OS). All the matrices that I'm dealing with are rectangular in dimension. For ex: -m 16 -n 200704 -k 27
.
Using image2d_t
has max width & height restrictions (it supports max 4096x4096 on Adreno 330 OpenCL 1.1 Embedded profile). So I cannot use the GEMM sample provided in Snapdragon blog for such matrices.
Have you come across any kernels that can handle such rectangular matrices better?
@sivagnanamn I'm not really interested in SGEMM. I want to do convolutions on Adreno GPU. I know that convolution can be calculated as matrix multiplication, so I searched good samples in this area. You can try change data layout for representation of your matrices, then it can fit in 4096x4096. Or you can try sample of @moskewcz it has the same performance without textures. But you have interesting dimensions, I think it can be optimized for this particular case.
So you got 115 GFLOPS on Adreno 330 for this particular dimensions -m 16 -n 200704 -k 27, am I right? Can you also tell the frequency?
@roserg 578 MHz, but as @CNugteren suggested I wouldn't trust on 115 GFLOPS. The adreno_tryout
branch isn't fully implemented, its still in trial & error stage. So we've to wait until its stable.
Indeed, in the adreno_tryout
branch the kernel assumes multiples of 16 or 32 or so, and you have e.g. k=27. That won't work. I hope to work on this again soon.
For the record, I've been working on this issue the last week. Basically I took the kernel from the adreno_tryout
branch and am generalising it with the regular CLBlast tuning parameters and making it possible to integrate it with the current GEMM kernel without having 2 kernels, but rather having one kernel with new tuning possibilities.
It is not an easy task, but the end is in sight :-) Soon I will ask anyone who's interested to try out some branch.
As I said above, I took the kernel from the adreno_tryout
branch and have fully integrated it into CLBlast. That means it is also tuneable with many of the same parameters as the regular kernel, so there is some potential now for Adreno GPUs.
Could you have a try? The branch is CLBlast-228-2d-register-gemm-kernel
and you should just be able to run the tuner as normally. It will first tune the regular kernel (a fixed set of parameters followed by a random subset of parameters) and then the new kernel (again first a fixed set and then a random set). I'm curious to see what the results in terms of GFLOPS are for the tuner with the new kernel on your devices.
Getting seg fault during tuning.
| 307 | 370 | 0 64 64 32 8 8 8 8 2 1 1 0 0 0 0 1 | OK 447 ms | 2117.35 ms | 1.0 | results match |
| 308 | 370 | 0 64 64 32 8 8 8 8 2 1 2 0 0 0 0 1 | OK 404 ms | 1964.66 ms | 1.1 | results match |
| 309 | 370 | 0 64 64 32 8 8 8 8 2 1 4 0 0 0 0 1 | OK 500 ms | 2294.15 ms | 0.9 | results match |
| 310 | 370 | 0 64 64 32 8 8 8 8 2 2 1 0 0 0 0 1 | OK 336 ms | 3460.13 ms | 0.6 | results match |
| 311 | 370 | 0 64 64 32 8 8 8 8 2 2 2 0 0 0 0 1 | OK 356 ms | 3037.90 ms | 0.7 | results match |
| 312 | 370 | 0 64 64 32 8 8 8 8 2 2 4 0 0 0 0 1 | OK 405 ms | 4484.14 ms | 0.5 | results match |
Segmentation fault
OK, but that's perhaps an issue on your device/platform? I'll continue development a bit (some related tests still fail: SYRK/HERK) and then I'll test on some other platforms as well. Let's test again afterwards.
I have completely finished the implementation and also tested on another machine, no issues seen so far. I'll soon merge the branch with master..
Did you see good performance while tuning so far? Otherwise you can try to limit the search space of the tuning parameters here to remove the configurations that might fail, e.g. remove MWG==64 from the search space.
FYI, this branch is now merged into master.
My intel results from this change (if relevant) are here.
Thank you for your time @CNugteren .
I did try removing MWG==64
from src/tunining/kernels/xgemm.hpp
, still tuning failed and caused device reboot. I've removed any option > 32 and started tuning again. I'll share the results once it is completed.
Anyone can explain why Adreno OpenCL gpu performance too much bad ??
Can I increase Adreno OpenCL gpu performance?? Every time showing Adreno OpenCL gpu frequency only 1 MHz
Sdm 865 && Adreno 650 gpu
Your question has no relation to CLBlast, but is more a question about your system. Before concluding too much, consider that it might perhaps be just a display issue of the clinfo
tool (or of the OpenCL driver installed on your system). Most likely everything is just fast.
As reported by sivagnanamn in issue #181, perhaps on Qualcomm Adreno GPUs is far from the theoretical maximum. This should be addressed by comparing reference GEMM implementations on Adreno hardware with the design of CLBlast. One example is in a Qualcomm tutorial part 1 and part 2, but not much else seems to be available.
Any links to fast OpenCL implementations on Adreno GPUs are welcome here. When I have some time I'll try to get hold of such a device and try some things out.
Can it possible to optimise adreno OpenCL kernel Just got reply from Qualcomm
https://developer.qualcomm.com/forum/qdn-forums/software/adreno-gpu-sdk/68728
Can it possible to optimise adreno OpenCL kernel Just got reply from Qualcomm https://developer.qualcomm.com/forum/qdn-forums/software/adreno-gpu-sdk/68728
Thanks for sharing. Unfortunately I don't have time myself, but I'm happy to review any pull requests made.
I'm closing this issue for now thanks to changes made in https://github.com/CNugteren/CLBlast/pull/451 and https://github.com/CNugteren/CLBlast/pull/452. If any particular issue shows up for a specific device, feel free to open an issue specific for it.