CLBlast
CLBlast copied to clipboard
New tuning results
(See the README for details)
This is the place to post new tuning results. If you compiled with -DTUNERS=ON
, ran one of the tuners on your device (or all perhaps?), and feel that these results should be included in the next release of CLBlast, please post them here.
You can do this by attaching the JSON files to this issue (archived in a .ZIP file).
Here are some tuning results from an NVIDIA Titan Black, AMD Radeon HD 7970 and an ARM Mali T-628.
Just to let you know about JSON files, GitHub says "Unfortunately, we don’t support that file type. Choose Files Try again with a PNG, GIF, JPG, DOCX, PPTX, XLSX, TXT, PDF, or ZIP." Archive.zip
Thanks for the tuning results! However, they seem to be ran with non-default settings (using specific values for alpha
and beta
). Could you perhaps run them again with the default settings?
By the way, the latest version already includes results for Tahiti (the HD 7970) and the ARM Mali T-628, so perhaps those are superfluous.
(I've updated the post regarding JSON-files and GitHub)
Here are the results for AMD's Pitcairn (R9 270X). I'll also upload the results for Hawaii (R9 290X), but I am getting an error during Xgemm. I'll open another issue for that. pitcairn.zip
Thanks! The results for Pitcairn are added to the development
branch.
Hawaii (AMD R9 290X): hawaii.zip
And i7 4790k: i7-4790k.zip
The results for Hawaii will be added. As for the i7 results: the zip archive seems to include only a Makefile?
Sorry, I messed up that zip. As I do not have those files any more, I'll send them when I manage to do that tuning.
@FongHou Thanks! The tuning results are added to the database. They are currently in the development
branch but will be automatically included in the next release.
Here are the results for the Intel i5-4210U iGPU:
Device name: 'Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile' (OpenCL 1.2 beignet 1.2 (git-1b076ec))
i5-4210U_GPU.zip
@OursDesCavernes Added, thanks!
GTX 670, GTX 750 (non-Ti), and GTX 1070 tunings attached. One of the GEMV tunings took ages (or hung) on the latter two, but curiously enough not on the (older) first card. Luckily, it looks like GEMV is the last one to be tuned so these are fairly complete anyway.
@gcp Thanks for running all the tuners on those devices! The results are added to CLBlast, currently in the development
branch but they will be automatically included in the next release. Indeed, I saw long compilation times for GEMV kernels on NVIDIA as well - it is the last one to be tuned for exactly this reason. NVIDIA promises to reduce compilation times significantly with CUDA 8.0, so hopefully that also fixes these kernels.
Intel HD530 (desktop Skylake iGPU) IntelHD530.zip
@gcp Thanks, they are added.
Issue #83 caused a complete re-write of the third GEMV kernel (XgemvFastRot
), so I had to throw away the corresponding tuning results. If it's not too much effort, I welcome updated clblast_xgemv_fast_rot_*.json
tuning results based on the development
branch. The other GEMV tuning results are still valid and included in CLBlast. Thanks!
Intel(R) HD Graphics 5500 BroadWell U-Processor GT2: hd5500.zip Intel(R) HD Graphics Haswell Ultrabook GT2 Mobile: hd4400.zip
@OursDesCavernes Thanks, HD5500 is added and HD4400 is updated.
Intel(R) HD Graphics 4000 intel-hd4000.zip
@yingted Thanks! The tuning results for the IvyBridge GPU are added.
Radeon R9 380 (Tonga) tuning results: Tobago_TuningResults.zip
Of course, the device is called Tonga, just a spelling mistake of the zip-file name.
@MigMuc The results for Tonga are added, thanks!
Here are the results for the GTX Titan Black. Unfortunately, I had the same problem as @gcp on the last run. But again, should be fairly complete.
@matze Thanks a lot for your contribution. The tuning results are added.
Hi, since I'm having problems with attaching files, here are the links for:
Amd Radeon HD6770m (Turks) https://www.dropbox.com/s/wabso93trny8fae/amd%20hd6770m%20%28turks%29.zip?dl=0
Intel Core i7-2670qm https://www.dropbox.com/s/3as860nlbshmdvo/i7-2670qm.zip?dl=0
from my laptop. In few days I will be able to test a MSI Nvidia GTX 970
Thanks for the tuning data! The results are added to CLBlast, currently in the development
branch but they will be automatically included in the next release.
Tuning results for Nvidia GTX 1080 nvidia_gtx_1080.zip
Results for i7-4790k: i7-4790k.zip
Thanks a lot! Both the GTX 1080 and i7 results are added.
Tuning results for AMD RX480 (with amdgpu driver and amdgpu-pro opencl stack) amd-rx480.zip
@OursDesCavernes Added, thanks! Nice to see FP16 support from AMD's side as well.
My two cents: tuning results of an AMD Radeon HD 6750M (unfortunately no support for 16 or 64 bits)
The HD 6750M results are added, thanks!
See the attachment for some tuning results on AMD Radeon FuryX (using driver 1800.8). amd-radeon-furyx.zip
@bramveenboer The Fiji results are added, thanks!
Hi! Here are the results for an intel i7-920 on linux using Intel's OpenCL Driver dev-util/intel-ocl-sdk-4.4.0.117-r1
Thanks for your work!
Thanks, the Core i7-920 tuning data is added to CLBlast!
I have got results for the Radeon R9 390. There were already Hawaii results there, but the direct gemm kernel was missing. Overall, this improved things for me.
(note: is it possible to get a tuning for GemmBatched? I think it makes a difference whether a single matrix is to be computed vs a set of matrices. Also i have the feeling that gemm is too aggressively tuned towards m=n=k=1024. performance drops by 30-40% on m=n=k=2048 and stays there for larger matrices)
@Ulfgard I tuned those hawaii results on R9 290X. In my case, it would be impossible that performance drops 30%-40% for larger matrices, since I get (if memory serves me well) around 3.7 TFLOPS on 8192x8192, with the theoretical limit being 5.4 TFLOPS. If such performance drop happened, Cedric programmed gemm to run at max theoretical performance that disregards memory access, which seems imposible to me.
OTOH, maybe the drop in performance is simply because these cards are identified as the same (hawaii), but have some internal hardware difference that influences the optimal settings?
Hi, It is impossible to mix up the tunings, because you have to remove the old tuning to be able to add the new one in the database script. Otherwise it will fail. While I agree that the kernel itself gives okay performance according to the tuner, for some reason, the whole gemm call seems to die after exceeding some matrix size. I did some benchmarking of the whole procedure to see real world performance.
The numbers reported are the wallclock times between enqeuing several trials of the gemm routine and clFinish (disregarding the first trial for possible kernel setup, of course). Thus they are a lower bound on performance. The numbers are roughly in line with the timings reported by clGetEventProfilingInfo of the supplied event to gemm, but this does not necessarily make sense because I do not know which kernel this actually measures.
(columns are row/column major for A/B, and column C indicates whether C is row/column major. m=n=k=size. Numbers are GFlops)
size C A/B: r/r c/r r/c c/c 256 r 781.738 794.147 781.738 781.738 256 c 806.956 820.184 820.184 806.956 512 r 1005 1005 1005 440.789 512 c 1168.6 1116.67 1142.05 1142.05 1024 r 2080 2080 2080 712.329 1024 c 2363.64 2363.64 2363.64 2363.64 //this number fits quite closely with what the tuner reports 2048 r 1523.81 1523.81 1523.81 1523.81 //something here dies. 2048 c 1523.81 1523.81 1488.37 1523.81 4096 r 1523.81 1542.17 1542.17 1580.25 4096 c 1542.17 1542.17 1560.98 1560.98
Beforehand, i.e. with your tuning, the larger matrices where another 50% worse. So even if the gemm kernel is okay, maybe some of the other kernel is at fault here.
For completeness: the same results with the timings returned by clGetEventProfilingInfo for the event passed to the gemm routine (modulo possible errors because i quickly hacked this together):
size C A/B: r/r c/r r/c c/c 256 r 608.416 606.492 607.536 607.19 256 c 622.785 621.137 620.953 619.315 512 r 24963.3 25338 24816.4 25607.1//indication for that this measures the wrong thing? 512 c 813.057 811.845 813.214 812.822 1024 r 62673.7 63890.4 63529.4 63143.3//indication for that this measures the wrong thing? 1024 c 2559.24 2560.99 2557.5 2557.88 2048 r 1487.71 1492.89 1526.47 1530.32 2048 c 1488.77 1490.06 1524.83 1537.17 4096 r 1518.29 1522.16 1557.49 1567.47 4096 c 1524.63 1528.65 1563.42 1567.73
I was also talking about wall clock time in my (Clojure on the JVM) program, not ClTune results. 8192x8192 sgemm runs in 293 milliseconds on R9 290X (5.4 TFLOPS max).
GTX 1080 (8.2 TFLOPS) runs in 220 ms, which makes the numbers pretty consistent in my case.
@blueberry @Ulfgard I've opened issue #169 to have a more detailed discussion on the future of the tuner in CLBlast.
I'll add your tuning data soon to the database, thanks.
Here is my ubuntu16.04 with intel cpu driver: i7-6770hq.zip Tuned for 1.0.1 release. Impressive tool! Let me know if I included the wrong files.
Thanks @theoden8. It took a bit longer than normal since I was in the middle of some database changes, but the results are now added!
Here are the tuning results for a i5-4570 and a GTX580 GTX580.zip i5-4570.zip
Thanks @fzimmermann89, they are both added.
Some more results. Note that beignet (which I used) is 10-20% slower than Intel NEO.
Thank you for your great work! Here are some tuning results for NVidia GeForce GTX 1070 Ti.
Here are some tuning results using POCL (1.2-pre/master) on an Intel i5-4590S. The other tuners segfaulted (#293). i5_4590S_POCL.zip
A little late, but I've added the HD Graphics 6000, GTX 1070 Ti, and i5-4590S results. Thanks all!
Here are some tuning results from Intel Xeon E5-2630 v3 and v4, as well as Nvidia Tesla P100 PCI-E 16 GB. CLBlast_tuners.zip
Tuning results from Hikey 970 with a Mali-G72 GPU Do not use these results because when I launch them if I use Gemm with a size greater than 8 it causes an error in the library. Mali-G72.zip
I tuned the CLBlast on FT-2000plus CPU (2.3Ghz@64cores) , which is an ARMv8-based many-core CPU. tuned-FT-2000Plus-CPU.tar.gz
Sorry I had overlooked this issue for a while. I've just added tuning results for:
- Intel Xeon E5-2630 v3
- Intel Xeon E5-2630 v4
- NVIDIA Tesla P100
I've not added the results for the ARMv8 machine, since it shows the CPU as device '0x662' from vendor '0x70' in PoCL, perhaps that is not so meaningful. If anyone else is interested they can always take the results from here.
Thanks all for sharing!
I ran tuning using CLBlast 1.5.0 on a NVIDIA Titan RTX (using driver 415.125): titanrtx-415.125.tar.gz
Results for AMD Radeon RX Vega Radeon RX Vega.zip
Thanks for sharing the tuning results! I've just added both the RX Vega and also the Titan RTX (sorry I forgot about it) to CLBlast.
AMD RX 6800 XT (Navi21): amd_rx_6800_xt.tar.gz
my latest result on RX6500XT (this is win11 22.3 driver) (performance on linux may be a bit better) and Qualcomm Adreno 540 on SD835 phone.
Got several compilation error messages on Adreno & android. the return value -6 means out of host memory, I'd look into the memory management and find some clue.