Tullio.jl icon indicating copy to clipboard operation
Tullio.jl copied to clipboard

GPU performance

Open DilumAluthge opened this issue 3 years ago • 5 comments

I wasn't sure whether to open this issue here or on KernelAbstractions, so I figured I'd open it here first for discussion, and then we can move it to KernelAbstractions later if need be.

Right now, the GPU performance doesn't seem to scale very well on large matrices.

Here's an example plot for Matrix{Float32}=Matrix{Float16}×Matrix{Float16}, generated on a Titan V.

plot

Here's the code used to generate it:

using BLASBenchmarksGPU
import CUDA
bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
import PyPlot
BLASBenchmarksGPU.plotbench(bench_result, "plot.png")
CUDA.versioninfo()

And here's the output, including all of the TFLOPS values:

julia> using BLASBenchmarksGPU

julia> import CUDA

julia> bench_result = BLASBenchmarksGPU.runbench(:CUDA, Float16, Float16, Float32)
Progress: 100%|███████████████████████████████████████████████████| Time: 0:08:24
  Size:         16384
  CUBLAS:       62.9 TFLOPS
  GemmKernels:  66.19 TFLOPS
  Tullio:       0.29 TFLOPS
Bennchmark Result of Matrix{Float32}=Matrix{Float16}×Matrix{Float16}
24×4 DataFrame
 Row │ Size   Library      TFLOPS      Time
     │ Int64  Symbol       Float64     Float64
─────┼─────────────────────────────────────────────────
   1 │   128  CUBLAS        0.148099    28321.0
   2 │   128  GemmKernels   0.0332214  126253.0
   3 │   128  Tullio        0.0540789   77559.0
   4 │   256  CUBLAS        0.642928    52190.0
   5 │   256  GemmKernels   0.268756   124851.0
   6 │   256  Tullio        0.310169   108181.0
   7 │   512  CUBLAS        4.36686     61471.0
   8 │   512  GemmKernels   2.02769    132385.0
   9 │   512  Tullio        0.677953   395950.0
  10 │  1024  CUBLAS       25.168       85326.0
  11 │  1024  GemmKernels  14.2662     150529.0
  12 │  1024  Tullio        0.850125        2.52608e6
  13 │  2048  CUBLAS       59.5005     288735.0
  14 │  2048  GemmKernels  36.4345     471527.0
  15 │  2048  Tullio        0.76592         2.24304e7
  16 │  4096  CUBLAS       84.8186          1.62039e6
  17 │  4096  GemmKernels  57.559           2.38779e6
  18 │  4096  Tullio        0.393097        3.49631e8
  19 │  8192  CUBLAS       90.1617          1.21949e7
  20 │  8192  GemmKernels  59.9127          1.83519e7
  21 │  8192  Tullio        0.324503        3.38829e9
  22 │ 16384  CUBLAS       62.9003          1.39842e8
  23 │ 16384  GemmKernels  66.1909          1.3289e8
  24 │ 16384  Tullio        0.294184        2.98999e10


julia> import PyPlot

julia> BLASBenchmarksGPU.plotbench(bench_result, "plot.png")

julia> CUDA.versioninfo()
CUDA toolkit 11.1.1, artifact installation
CUDA driver 11.2.0
NVIDIA driver 460.27.4

Libraries:
- CUBLAS: 11.3.0
- CURAND: 10.2.2
- CUFFT: 10.3.0
- CUSOLVER: 11.0.1
- CUSPARSE: 11.3.0
- CUPTI: 14.0.0
- NVML: 11.0.0+460.27.4
- CUDNN: 8.0.4 (for CUDA 11.1.0)
- CUTENSOR: 1.2.1 (for CUDA 11.1.0)

Toolchain:
- Julia: 1.6.0-beta1
- LLVM: 11.0.0
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_VERBOSE: true

1 device:
  0: TITAN V (sm_70, 658.500 MiB / 11.784 GiB available)

DilumAluthge avatar Feb 01 '21 05:02 DilumAluthge

Here is a plot for Matrix{Float32}=Matrix{Float32}×Matrix{Float32}:

plot

And here are the TFLOPS values:

14×4 DataFrame
 Row │ Size   Library  TFLOPS      Time
     │ Int64  Symbol   Float64     Float64
─────┼────────────────────────────────────────────
   1 │   128  CUBLAS    0.189052    22186.0
   2 │   128  Tullio    0.0552115   75968.0
   3 │   256  CUBLAS    0.93772     35783.0
   4 │   256  Tullio    0.290567   115479.0
   5 │   512  CUBLAS    3.5438      75748.0
   6 │   512  Tullio    0.633555   423697.0
   7 │  1024  CUBLAS   19.232      111662.0
   8 │  1024  Tullio    0.745863        2.8792e6
   9 │  2048  CUBLAS   41.5728     413248.0
  10 │  2048  Tullio    0.535823        3.20626e7
  11 │  4096  CUBLAS   49.3898          2.78274e6
  12 │  4096  Tullio    0.31201         4.40496e8
  13 │  8192  CUBLAS   39.9035          2.75543e7
  14 │  8192  Tullio    0.298638        3.68176e9

DilumAluthge avatar Feb 01 '21 06:02 DilumAluthge

Here is Matrix{Float32}=Matrix{Float32}×Matrix{Float32} at small matrix sizes:

plot

And here are the TFLOPS values:

 Row │ Size   Library  TFLOPS       Time
     │ Int64  Symbol   Float64      Float64
─────┼───────────────────────────────────────
   1 │     1  CUBLAS   1.07614e-7    18585.0
   2 │     1  Tullio   1.1015e-7     18157.0
   3 │     2  CUBLAS   1.12859e-6    14177.0
   4 │     2  Tullio   8.46158e-7    18909.0
   5 │     3  CUBLAS   3.05603e-6    17670.0
   6 │     3  Tullio   2.58807e-6    20865.0
   7 │     4  CUBLAS   7.21411e-6    17743.0
   8 │     4  Tullio   6.12821e-6    20887.0
   9 │     5  CUBLAS   1.40473e-5    17797.0
  10 │     5  Tullio   1.19355e-5    20946.0
  11 │     6  CUBLAS   2.91695e-5    14810.0
  12 │     6  Tullio   2.05695e-5    21002.0
  13 │     7  CUBLAS   4.50575e-5    15225.0
  14 │     7  Tullio   3.25922e-5    21048.0
  15 │     8  CUBLAS   6.45772e-5    15857.0
  16 │     8  Tullio   4.75682e-5    21527.0
  17 │     9  CUBLAS   8.25875e-5    17654.0
  18 │     9  Tullio   6.71889e-5    21700.0
  19 │    10  CUBLAS   0.000111744   17898.0
  20 │    10  Tullio   9.05305e-5    22092.0
  21 │    30  CUBLAS   0.00295421    18279.0
  22 │    30  Tullio   0.00176655    30568.0
  23 │    50  CUBLAS   0.011872      21058.0
  24 │    50  Tullio   0.00448053    55797.0
  25 │    70  CUBLAS   0.0324304     21153.0
  26 │    70  Tullio   0.0105823     64825.0
  27 │    90  CUBLAS   0.0624679     23340.0
  28 │    90  Tullio   0.023251      62707.0
  29 │   110  CUBLAS   0.107313      24806.0
  30 │   110  Tullio   0.0386391     68894.0
  31 │   130  CUBLAS   0.194994      22534.0
  32 │   130  Tullio   0.0501592     87601.0
  33 │   150  CUBLAS   0.238196      28338.0
  34 │   150  Tullio   0.0735534     91770.0
  35 │   170  CUBLAS   0.395524      24843.0
  36 │   170  Tullio   0.104512      94018.0
  37 │   190  CUBLAS   0.437074      31386.0
  38 │   190  Tullio   0.140838      97403.0
  39 │   210  CUBLAS   0.548053      33796.0
  40 │   210  Tullio   0.177458     104374.0
  41 │   230  CUBLAS   0.76275       31903.0
  42 │   230  Tullio   0.223804     108729.0
  43 │   250  CUBLAS   1.06841       29249.0
  44 │   250  Tullio   0.273533     114246.0

DilumAluthge avatar Feb 01 '21 06:02 DilumAluthge

Thanks for looking into all this. That's certainly not a graph which will impress anyone!

I'd call it disappointing but unsurprising. Tullio generates the most naiive possible KernelAbstractions code, and I know from e.g. these transpose examples that there is much more than could be done. How easily and how well this can be done automatically, for arbitrary expressions, I don't know.

I also don't know whether there are operations on which this naiive code is more competitive. On the CPU, large matmul is a severe test of how you handle awkward memory access, but something like @tullio d[i] := p[i,k] * log(p[i,k]/q[i,k]) has much more work per element and a more obvious access pattern.

Would be nice for Tullio to do better at these things, but I don't even have a working GPU at the moment, and I have many projects. If anyone wanted to try that would be fantastic, I think the code to generate loops can be played with without diving into the more tangled macro code for parsing etc.

For the purposes of BLASBenchmarksGPU, there may already exist much better KernelAbstractions matmul code, no need to auto-generate it. And it might be interesting to see how varying levels of cleverness affect things at different sizes.

mcabbott avatar Feb 01 '21 06:02 mcabbott

Re the Float32 graph, in round numbers my CPU gets to about half a teraflop from the low-hundreds, which appears to be faster than Tullio's GPU matmul in the range shown. My graph is here:

https://github.com/mcabbott/Tullio.jl/blob/master/benchmarks/02/matmul-0.2.11-Float32-1.6.0-dev.png

And xref also #30 about GPU vs CPU speed.

mcabbott avatar Feb 01 '21 07:02 mcabbott

I would be curious to know whether #82 helps here. I am not expecting a huge difference, but it might help a little.

simeonschaub avatar Feb 08 '21 22:02 simeonschaub