GPU performance
I wasn't sure whether to open this issue here or on KernelAbstractions, so I figured I'd open it here first for discussion, and then we can move it to KernelAbstractions later if need be.
Right now, the GPU performance doesn't seem to scale very well on large matrices.
Here's an example plot for Matrix{Float32}=Matrix{Float16}×Matrix{Float16}, generated on a Titan V.

Here's the code used to generate it:
using BLASBenchmarksGPU
import CUDA
bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
import PyPlot
BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
CUDA.versioninfo()
And here's the output, including all of the TFLOPS values:
julia> using BLASBenchmarksGPU
julia> import CUDA
julia> bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
Progress: 100%|███████████████████████████████████████████████████| Time: 0:08:24
Size: 16384
CUBLAS: 62.9 TFLOPS
GemmKernels: 66.19 TFLOPS
Tullio: 0.29 TFLOPS
Bennchmark Result of Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
24×4 DataFrame
Row │ Size Library TFLOPS Time
│ Int64 Symbol Float64 Float64
─────┼─────────────────────────────────────────────────
1 │ 128 CUBLAS 0.148099 28321.0
2 │ 128 GemmKernels 0.0332214 126253.0
3 │ 128 Tullio 0.0540789 77559.0
4 │ 256 CUBLAS 0.642928 52190.0
5 │ 256 GemmKernels 0.268756 124851.0
6 │ 256 Tullio 0.310169 108181.0
7 │ 512 CUBLAS 4.36686 61471.0
8 │ 512 GemmKernels 2.02769 132385.0
9 │ 512 Tullio 0.677953 395950.0
10 │ 1024 CUBLAS 25.168 85326.0
11 │ 1024 GemmKernels 14.2662 150529.0
12 │ 1024 Tullio 0.850125 2.52608e6
13 │ 2048 CUBLAS 59.5005 288735.0
14 │ 2048 GemmKernels 36.4345 471527.0
15 │ 2048 Tullio 0.76592 2.24304e7
16 │ 4096 CUBLAS 84.8186 1.62039e6
17 │ 4096 GemmKernels 57.559 2.38779e6
18 │ 4096 Tullio 0.393097 3.49631e8
19 │ 8192 CUBLAS 90.1617 1.21949e7
20 │ 8192 GemmKernels 59.9127 1.83519e7
21 │ 8192 Tullio 0.324503 3.38829e9
22 │ 16384 CUBLAS 62.9003 1.39842e8
23 │ 16384 GemmKernels 66.1909 1.3289e8
24 │ 16384 Tullio 0.294184 2.98999e10
julia> import PyPlot
julia> BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4
Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)
Toolchain:
- Julia: 1.6.0-beta1
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_VERBOSE: true
1 device:
0: TITAN V (sm_70, 658.500 MiB / 11.784 GiB available)
Here is a plot for Matrix{Float32}=Matrix{Float32}×Matrix{Float32}:

And here are the TFLOPS values:
14×4 DataFrame
Row │ Size Library TFLOPS Time
│ Int64 Symbol Float64 Float64
─────┼────────────────────────────────────────────
1 │ 128 CUBLAS 0.189052 22186.0
2 │ 128 Tullio 0.0552115 75968.0
3 │ 256 CUBLAS 0.93772 35783.0
4 │ 256 Tullio 0.290567 115479.0
5 │ 512 CUBLAS 3.5438 75748.0
6 │ 512 Tullio 0.633555 423697.0
7 │ 1024 CUBLAS 19.232 111662.0
8 │ 1024 Tullio 0.745863 2.8792e6
9 │ 2048 CUBLAS 41.5728 413248.0
10 │ 2048 Tullio 0.535823 3.20626e7
11 │ 4096 CUBLAS 49.3898 2.78274e6
12 │ 4096 Tullio 0.31201 4.40496e8
13 │ 8192 CUBLAS 39.9035 2.75543e7
14 │ 8192 Tullio 0.298638 3.68176e9
Here is Matrix{Float32}=Matrix{Float32}×Matrix{Float32} at small matrix sizes:

And here are the TFLOPS values:
Row │ Size Library TFLOPS Time
│ Int64 Symbol Float64 Float64
─────┼───────────────────────────────────────
1 │ 1 CUBLAS 1.07614e-7 18585.0
2 │ 1 Tullio 1.1015e-7 18157.0
3 │ 2 CUBLAS 1.12859e-6 14177.0
4 │ 2 Tullio 8.46158e-7 18909.0
5 │ 3 CUBLAS 3.05603e-6 17670.0
6 │ 3 Tullio 2.58807e-6 20865.0
7 │ 4 CUBLAS 7.21411e-6 17743.0
8 │ 4 Tullio 6.12821e-6 20887.0
9 │ 5 CUBLAS 1.40473e-5 17797.0
10 │ 5 Tullio 1.19355e-5 20946.0
11 │ 6 CUBLAS 2.91695e-5 14810.0
12 │ 6 Tullio 2.05695e-5 21002.0
13 │ 7 CUBLAS 4.50575e-5 15225.0
14 │ 7 Tullio 3.25922e-5 21048.0
15 │ 8 CUBLAS 6.45772e-5 15857.0
16 │ 8 Tullio 4.75682e-5 21527.0
17 │ 9 CUBLAS 8.25875e-5 17654.0
18 │ 9 Tullio 6.71889e-5 21700.0
19 │ 10 CUBLAS 0.000111744 17898.0
20 │ 10 Tullio 9.05305e-5 22092.0
21 │ 30 CUBLAS 0.00295421 18279.0
22 │ 30 Tullio 0.00176655 30568.0
23 │ 50 CUBLAS 0.011872 21058.0
24 │ 50 Tullio 0.00448053 55797.0
25 │ 70 CUBLAS 0.0324304 21153.0
26 │ 70 Tullio 0.0105823 64825.0
27 │ 90 CUBLAS 0.0624679 23340.0
28 │ 90 Tullio 0.023251 62707.0
29 │ 110 CUBLAS 0.107313 24806.0
30 │ 110 Tullio 0.0386391 68894.0
31 │ 130 CUBLAS 0.194994 22534.0
32 │ 130 Tullio 0.0501592 87601.0
33 │ 150 CUBLAS 0.238196 28338.0
34 │ 150 Tullio 0.0735534 91770.0
35 │ 170 CUBLAS 0.395524 24843.0
36 │ 170 Tullio 0.104512 94018.0
37 │ 190 CUBLAS 0.437074 31386.0
38 │ 190 Tullio 0.140838 97403.0
39 │ 210 CUBLAS 0.548053 33796.0
40 │ 210 Tullio 0.177458 104374.0
41 │ 230 CUBLAS 0.76275 31903.0
42 │ 230 Tullio 0.223804 108729.0
43 │ 250 CUBLAS 1.06841 29249.0
44 │ 250 Tullio 0.273533 114246.0
Thanks for looking into all this. That's certainly not a graph which will impress anyone!
I'd call it disappointing but unsurprising. Tullio generates the most naiive possible KernelAbstractions code, and I know from e.g. these transpose examples that there is much more than could be done. How easily and how well this can be done automatically, for arbitrary expressions, I don't know.
I also don't know whether there are operations on which this naiive code is more competitive. On the CPU, large matmul is a severe test of how you handle awkward memory access, but something like @tullio d[i] := p[i,k] * log(p[i,k]/q[i,k]) has much more work per element and a more obvious access pattern.
Would be nice for Tullio to do better at these things, but I don't even have a working GPU at the moment, and I have many projects. If anyone wanted to try that would be fantastic, I think the code to generate loops can be played with without diving into the more tangled macro code for parsing etc.
For the purposes of BLASBenchmarksGPU, there may already exist much better KernelAbstractions matmul code, no need to auto-generate it. And it might be interesting to see how varying levels of cleverness affect things at different sizes.
Re the Float32 graph, in round numbers my CPU gets to about half a teraflop from the low-hundreds, which appears to be faster than Tullio's GPU matmul in the range shown. My graph is here:
https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.11-Float32-1.6.0-dev.png
And xref also #30 about GPU vs CPU speed.
I would be curious to know whether #82 helps here. I am not expecting a huge difference, but it might help a little.