CLBlast New tuning results

New tuning results

Open CNugteren opened this issue 9 years ago • 133 comments

(See the README for details)

This is the place to post new tuning results. If you compiled with -DTUNERS=ON, ran one of the tuners on your device (or all perhaps?), and feel that these results should be included in the next release of CLBlast, please post them here.

You can do this by attaching the JSON files to this issue (archived in a .ZIP file).

May 30 '15 11:05 CNugteren

Here are some tuning results from an NVIDIA Titan Black, AMD Radeon HD 7970 and an ARM Mali T-628.

Just to let you know about JSON files, GitHub says "Unfortunately, we don’t support that file type. Choose Files Try again with a PNG, GIF, JPG, DOCX, PPTX, XLSX, TXT, PDF, or ZIP." Archive.zip

Apr 08 '16 00:04 tremmelg

Thanks for the tuning results! However, they seem to be ran with non-default settings (using specific values for alpha and beta). Could you perhaps run them again with the default settings?

By the way, the latest version already includes results for Tahiti (the HD 7970) and the ARM Mali T-628, so perhaps those are superfluous.

(I've updated the post regarding JSON-files and GitHub)

Apr 12 '16 03:04 CNugteren

Here are the results for AMD's Pitcairn (R9 270X). I'll also upload the results for Hawaii (R9 290X), but I am getting an error during Xgemm. I'll open another issue for that. pitcairn.zip

Apr 30 '16 14:04 blueberry

Thanks! The results for Pitcairn are added to the development branch.

May 01 '16 17:05 CNugteren

Hawaii (AMD R9 290X): hawaii.zip

May 01 '16 19:05 blueberry

And i7 4790k: i7-4790k.zip

May 01 '16 19:05 blueberry

The results for Hawaii will be added. As for the i7 results: the zip archive seems to include only a Makefile?

May 02 '16 18:05 CNugteren

Sorry, I messed up that zip. As I do not have those files any more, I'll send them when I manage to do that tuning.

May 02 '16 20:05 blueberry

nvidia-grid-k520-aws-g2.zip

See details https://github.com/CNugteren/CLBlast/issues/61

May 31 '16 18:05 fonghou

@FongHou Thanks! The tuning results are added to the database. They are currently in the development branch but will be automatically included in the next release.

Jun 01 '16 07:06 CNugteren

Here are the results for the Intel i5-4210U iGPU: Device name: 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (OpenCL 1.2 beignet 1.2 (git-1b076ec)) i5-4210U_GPU.zip

Jun 18 '16 12:06 OursDesCavernes

@OursDesCavernes Added, thanks!

Jun 19 '16 13:06 CNugteren

GTX 670, GTX 750 (non-Ti), and GTX 1070 tunings attached. One of the GEMV tunings took ages (or hung) on the latter two, but curiously enough not on the (older) first card. Luckily, it looks like GEMV is the last one to be tuned so these are fairly complete anyway.

gtx670.tar.gz gtx1070.tar.gz gtx750.tar.gz

Jul 01 '16 16:07 gcp

@gcp Thanks for running all the tuners on those devices! The results are added to CLBlast, currently in the development branch but they will be automatically included in the next release. Indeed, I saw long compilation times for GEMV kernels on NVIDIA as well - it is the last one to be tuned for exactly this reason. NVIDIA promises to reduce compilation times significantly with CUDA 8.0, so hopefully that also fixes these kernels.

Jul 03 '16 18:07 CNugteren

Intel HD530 (desktop Skylake iGPU) IntelHD530.zip

Jul 05 '16 11:07 gcp

@gcp Thanks, they are added.

Jul 10 '16 09:07 CNugteren

Issue #83 caused a complete re-write of the third GEMV kernel (XgemvFastRot), so I had to throw away the corresponding tuning results. If it's not too much effort, I welcome updated clblast_xgemv_fast_rot_*.json tuning results based on the development branch. The other GEMV tuning results are still valid and included in CLBlast. Thanks!

Jul 26 '16 18:07 CNugteren

Intel(R) HD Graphics 5500 BroadWell U-Processor GT2: hd5500.zip Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile: hd4400.zip

Aug 23 '16 20:08 OursDesCavernes

@OursDesCavernes Thanks, HD5500 is added and HD4400 is updated.

Sep 03 '16 16:09 CNugteren

Intel(R) HD Graphics 4000 intel-hd4000.zip

Oct 11 '16 18:10 yingted

@yingted Thanks! The tuning results for the IvyBridge GPU are added.

Oct 13 '16 10:10 CNugteren

Radeon R9 380 (Tonga) tuning results: Tobago_TuningResults.zip

Oct 22 '16 00:10 MigMuc

Of course, the device is called Tonga, just a spelling mistake of the zip-file name.

Oct 22 '16 00:10 MigMuc

@MigMuc The results for Tonga are added, thanks!

Oct 22 '16 14:10 CNugteren

Here are the results for the GTX Titan Black. Unfortunately, I had the same problem as @gcp on the last run. But again, should be fairly complete.

gtx-titan-black.tar.gz

Oct 24 '16 08:10 matze

@matze Thanks a lot for your contribution. The tuning results are added.

Oct 24 '16 17:10 CNugteren

Hi, since I'm having problems with attaching files, here are the links for:

Amd Radeon HD6770m (Turks) https://www.dropbox.com/s/wabso93trny8fae/amd%20hd6770m%20%28turks%29.zip?dl=0

Intel Core i7-2670qm https://www.dropbox.com/s/3as860nlbshmdvo/i7-2670qm.zip?dl=0

from my laptop. In few days I will be able to test a MSI Nvidia GTX 970

Jan 03 '17 00:01 bond90

Thanks for the tuning data! The results are added to CLBlast, currently in the development branch but they will be automatically included in the next release.

Jan 03 '17 19:01 CNugteren

Tuning results for Nvidia GTX 1080 nvidia_gtx_1080.zip

Jan 18 '17 20:01 blueberry

Results for i7-4790k: i7-4790k.zip

Jan 18 '17 22:01 blueberry

Thanks a lot! Both the GTX 1080 and i7 results are added.

Jan 19 '17 18:01 CNugteren

Tuning results for AMD RX480 (with amdgpu driver and amdgpu-pro opencl stack) amd-rx480.zip

Feb 16 '17 21:02 OursDesCavernes

@OursDesCavernes Added, thanks! Nice to see FP16 support from AMD's side as well.

Feb 18 '17 13:02 CNugteren

My two cents: tuning results of an AMD Radeon HD 6750M (unfortunately no support for 16 or 64 bits)

AMD Radeon HD 6750M.zip

Mar 01 '17 19:03 mrTsjolder

The HD 6750M results are added, thanks!

Mar 04 '17 14:03 CNugteren

See the attachment for some tuning results on AMD Radeon FuryX (using driver 1800.8). amd-radeon-furyx.zip

May 11 '17 23:05 csbnw

@bramveenboer The Fiji results are added, thanks!

May 12 '17 05:05 CNugteren

Hi! Here are the results for an intel i7-920 on linux using Intel's OpenCL Driver dev-util/intel-ocl-sdk-4.4.0.117-r1

Thanks for your work!

i7-920.zip

Jun 15 '17 15:06 CaptainSifff

Thanks, the Core i7-920 tuning data is added to CLBlast!

Jun 18 '17 18:06 CNugteren

I have got results for the Radeon R9 390. There were already Hawaii results there, but the direct gemm kernel was missing. Overall, this improved things for me.

Hawaii.zip

(note: is it possible to get a tuning for GemmBatched? I think it makes a difference whether a single matrix is to be computed vs a set of matrices. Also i have the feeling that gemm is too aggressively tuned towards m=n=k=1024. performance drops by 30-40% on m=n=k=2048 and stays there for larger matrices)

Jul 03 '17 21:07 Ulfgard

@Ulfgard I tuned those hawaii results on R9 290X. In my case, it would be impossible that performance drops 30%-40% for larger matrices, since I get (if memory serves me well) around 3.7 TFLOPS on 8192x8192, with the theoretical limit being 5.4 TFLOPS. If such performance drop happened, Cedric programmed gemm to run at max theoretical performance that disregards memory access, which seems imposible to me.

OTOH, maybe the drop in performance is simply because these cards are identified as the same (hawaii), but have some internal hardware difference that influences the optimal settings?

Jul 04 '17 01:07 blueberry

Hi, It is impossible to mix up the tunings, because you have to remove the old tuning to be able to add the new one in the database script. Otherwise it will fail. While I agree that the kernel itself gives okay performance according to the tuner, for some reason, the whole gemm call seems to die after exceeding some matrix size. I did some benchmarking of the whole procedure to see real world performance.

The numbers reported are the wallclock times between enqeuing several trials of the gemm routine and clFinish (disregarding the first trial for possible kernel setup, of course). Thus they are a lower bound on performance. The numbers are roughly in line with the timings reported by clGetEventProfilingInfo of the supplied event to gemm, but this does not necessarily make sense because I do not know which kernel this actually measures.

(columns are row/column major for A/B, and column C indicates whether C is row/column major. m=n=k=size. Numbers are GFlops)

size C A/B: r/r c/r r/c c/c 256 r 781.738 794.147 781.738 781.738 256 c 806.956 820.184 820.184 806.956 512 r 1005 1005 1005 440.789 512 c 1168.6 1116.67 1142.05 1142.05 1024 r 2080 2080 2080 712.329 1024 c 2363.64 2363.64 2363.64 2363.64 //this number fits quite closely with what the tuner reports 2048 r 1523.81 1523.81 1523.81 1523.81 //something here dies. 2048 c 1523.81 1523.81 1488.37 1523.81 4096 r 1523.81 1542.17 1542.17 1580.25 4096 c 1542.17 1542.17 1560.98 1560.98

Beforehand, i.e. with your tuning, the larger matrices where another 50% worse. So even if the gemm kernel is okay, maybe some of the other kernel is at fault here.

For completeness: the same results with the timings returned by clGetEventProfilingInfo for the event passed to the gemm routine (modulo possible errors because i quickly hacked this together):

size C A/B: r/r c/r r/c c/c 256 r 608.416 606.492 607.536 607.19 256 c 622.785 621.137 620.953 619.315 512 r 24963.3 25338 24816.4 25607.1//indication for that this measures the wrong thing? 512 c 813.057 811.845 813.214 812.822 1024 r 62673.7 63890.4 63529.4 63143.3//indication for that this measures the wrong thing? 1024 c 2559.24 2560.99 2557.5 2557.88 2048 r 1487.71 1492.89 1526.47 1530.32 2048 c 1488.77 1490.06 1524.83 1537.17 4096 r 1518.29 1522.16 1557.49 1567.47 4096 c 1524.63 1528.65 1563.42 1567.73

Jul 04 '17 05:07 Ulfgard

I was also talking about wall clock time in my (Clojure on the JVM) program, not ClTune results. 8192x8192 sgemm runs in 293 milliseconds on R9 290X (5.4 TFLOPS max).

GTX 1080 (8.2 TFLOPS) runs in 220 ms, which makes the numbers pretty consistent in my case.

Jul 04 '17 07:07 blueberry

@blueberry @Ulfgard I've opened issue #169 to have a more detailed discussion on the future of the tuner in CLBlast.

I'll add your tuning data soon to the database, thanks.

Jul 05 '17 07:07 CNugteren

Here is my ubuntu16.04 with intel cpu driver: i7-6770hq.zip Tuned for 1.0.1 release. Impressive tool! Let me know if I included the wrong files.

Sep 07 '17 00:09 theoden8

Thanks @theoden8. It took a bit longer than normal since I was in the middle of some database changes, but the results are now added!

Sep 16 '17 19:09 CNugteren

Here are the tuning results for a i5-4570 and a GTX580 GTX580.zip i5-4570.zip

Oct 12 '17 21:10 fzimmermann89

Thanks @fzimmermann89, they are both added.

Oct 20 '17 16:10 CNugteren

Some more results. Note that beignet (which I used) is 10-20% slower than Intel NEO.

Intel(R) HD Graphics 6000 BroadWell U-Processor GT3.zip

Apr 30 '18 01:04 kodonnell

Thank you for your great work! Here are some tuning results for NVidia GeForce GTX 1070 Ti.

GeForce_GTX_1070_Ti.zip

Jun 23 '18 04:06 ranocha

Here are some tuning results using POCL (1.2-pre/master) on an Intel i5-4590S. The other tuners segfaulted (#293). i5_4590S_POCL.zip

Jun 28 '18 04:06 ranocha

A little late, but I've added the HD Graphics 6000, GTX 1070 Ti, and i5-4590S results. Thanks all!

Jul 13 '18 19:07 CNugteren

Here are some tuning results from Intel Xeon E5-2630 v3 and v4, as well as Nvidia Tesla P100 PCI-E 16 GB. CLBlast_tuners.zip

Aug 06 '18 13:08 villekf

Tuning results from Hikey 970 with a Mali-G72 GPU Do not use these results because when I launch them if I use Gemm with a size greater than 8 it causes an error in the library. Mali-G72.zip

Oct 10 '18 07:10 jaquerinte

I tuned the CLBlast on FT-2000plus CPU (2.3Ghz@64cores) , which is an ARMv8-based many-core CPU. tuned-FT-2000Plus-CPU.tar.gz

Jan 08 '19 00:01 TaihuLight

Sorry I had overlooked this issue for a while. I've just added tuning results for:

Intel Xeon E5-2630 v3
Intel Xeon E5-2630 v4
NVIDIA Tesla P100

I've not added the results for the ARMv8 machine, since it shows the CPU as device '0x662' from vendor '0x70' in PoCL, perhaps that is not so meaningful. If anyone else is interested they can always take the results from here.

Thanks all for sharing!

Feb 09 '19 15:02 CNugteren

I ran tuning using CLBlast 1.5.0 on a NVIDIA Titan RTX (using driver 415.125): titanrtx-415.125.tar.gz

Feb 11 '19 09:02 csbnw

Results for AMD Radeon RX Vega Radeon RX Vega.zip

Oct 07 '20 15:10 JSav87

Thanks for sharing the tuning results! I've just added both the RX Vega and also the Titan RTX (sorry I forgot about it) to CLBlast.

Oct 10 '20 11:10 CNugteren

i9-9980HK.zip T2000.zip T4.zip a100.zip v100.zip

Aug 19 '21 04:08 umar456

QuadroGV100.zip

Aug 19 '21 12:08 umar456

AMD RX 6800 XT (Navi21): amd_rx_6800_xt.tar.gz

Jan 08 '22 02:01 ipapadop

my latest result on RX6500XT (this is win11 22.3 driver) (performance on linux may be a bit better) and Qualcomm Adreno 540 on SD835 phone.

Got several compilation error messages on Adreno & android. the return value -6 means out of host memory, I'd look into the memory management and find some clue.

RX6500Adreno540.tar.gz

Apr 13 '22 07:04 danyougle

CLBlast CLBlast copied to clipboard

New tuning results

CLBlast
CLBlast copied to clipboard