Tullio.jl
Tullio.jl copied to clipboard
Slower performance on GPU than CPU with Zygote
I'm experiencing some slower performances when taking gradients on CuArray
s, than normal CPU Array
s, using Zygote. Here's an MWE (the Tullio operations implement a mixture linear regression):
using Tullio
using Zygote
using CUDA, KernelAbstractions
CUDA.allowscalar(false)
function prediction(V, W, r)
@tullio mixture[a, b, d] := V[a, b, c] * W[d, c]
@tullio r̂[a, d] := mixture[a, b, d] * r[b, d]
return r̂
end
### CuArray version
V = CuArray(randn(15, 15, 3))
W = CuArray(randn(150, 3))
@time begin
grad = gradient(Params([V])) do
l = 0
for t = 1:5
r = CuArray(randn(15, 150))
r̂ = prediction(V, W, r)
l += sum(r - r̂)
end
l
end
end
# output: 0.304601 seconds (516.80 k allocations: 15.125 MiB)
If I replace all the arrays to CPU, I get 0.151464 seconds (209.86 k allocations: 13.321 MiB)
.
I'm wondering if this is expected? Or am I forgetting something when using Cuda arrays that slows down the performance?
I'm away from my GPU, but there are indeed some performance problems which I haven't got around to looking into. IIRC matmul!(a, b, c)
from https://github.com/JuliaGPU/KernelAbstractions.jl/blob/master/examples/matmul.jl was faster than
t_mul!(a, b, c) = @tullio c[i,j] = a[i,k] * b[k, j]
although they should be more or less identical.
But it's also possible that randn(15, 15, 3)
is just too small to benefit much. And finally, does CuArray(randn(15, 150))
make a CuArray{Float64}, or Float32?
I tried to use a larger size for the tensors, but the CuArray
just takes too long to run. So it does look like the runtime gets worse with size. It was initialized as Float64
, yes, but Float32
didn't make too much of a difference in my testing env (Ubuntu 18.04, CUDA 10.2)
I can reproduce this on google Colab now. And I ran Mason's test, but with a 3rd line for the KernelAbstractions matmul example linked above. Results are:
N = 100
0.000479 seconds (184 CPU allocations: 5.250 KiB) # @tullio
0.000186 seconds (14 CPU allocations: 400 bytes) # mul!, i.e. CuBLAS
0.000285 seconds (233 CPU allocations: 5.531 KiB) # KA's example code
N = 200
0.000818 seconds (698 CPU allocations: 13.281 KiB)
0.000214 seconds (14 CPU allocations: 400 bytes)
0.000654 seconds (743 CPU allocations: 13.500 KiB)
N = 500
0.005852 seconds (8.60 k CPU allocations: 136.688 KiB)
0.000415 seconds (14 CPU allocations: 400 bytes)
0.005665 seconds (8.51 k CPU allocations: 134.781 KiB)
N = 1000
0.044196 seconds (66.68 k CPU allocations: 1.020 MiB)
0.001492 seconds (14 CPU allocations: 400 bytes)
0.041200 seconds (57.60 k CPU allocations: 916.719 KiB)
N = 2000
0.319958 seconds (430.11 k CPU allocations: 6.573 MiB)
0.007682 seconds (14 CPU allocations: 400 bytes)
0.274934 seconds (384.18 k CPU allocations: 5.872 MiB)
N = 5000
4.505518 seconds (6.34 M CPU allocations: 96.697 MiB, 1.09% gc time)
0.084628 seconds (14 CPU allocations: 400 bytes)
4.367660 seconds (6.16 M CPU allocations: 93.929 MiB, 0.48% gc time)
So it looks like @tullio
is correctly producing the most naiive KA code (maybe with a few 100μs overhead) but this is very slow right now compared to optimised ones. Slower than CPU, too. I don't really know anything about GPUs so can't promise to fix that soon. I guess the present state is a bit like using @einsum
on the CPU.
In your example, mixture[a, b, d] := V[a, b, c] * W[d, c]
is ordinary matrix multiplication, with W transposed & V reshaped. And r̂[a, d] := mixture[a, b, d] * r[b, d]
is batched multiplication on d
, so can be written with NNlib.batched_mul
after reshaping r
. If this is the real operation, that's probably what you should do. (OMEinsum.jl might be able to write these reshapes for you, with the same notation.)
Thanks for testing! I did rewrite the code with reshape and batched_mul, which is faster on the GPUs. I guess I will leave this open for now :)
Is it possible to assume no aliasing? It would be good to add some @const
declarations to the generated kernel.
Yes I meant to do that. Any idea if this speeds up the examples/matmul.jl
linked above?
I don't know about that example, but I know we about halved the time in DiffEqGPU by adding those designations.
OK, good to hear.
Some possible improvements in #32, but CI isn't happy, and my computer with a GPU isn't answering the phone.
You should join the JuliaGPU group which has a bunch of GPU CIs for the community.
Yes I am a member, it's a great resource to have. https://gitlab.com/JuliaGPU/Tullio.jl/-/jobs/725368493 is my failed test. But further debugging may have to wait until I can try things locally.
More seems to be going on when using Zygote. In this example Tullio on gpu beats the cpu version in the forward pass, but the gradient calculation goes much slower. Don't Tullio gradients come down to generating a different Tullio expression?
using Tullio, KernelAbstractions, Flux, BenchmarkTools, CUDA
a = rand(100, 200, 3); b = rand(200, 100, 3);
f(a, b) = begin
@tullio out[m, n, batch] := a[m, i, batch] * b[i, n, batch]
sum(out)
end
# CPU
@btime f(a, b) # 601.801 μs (201 allocations: 248.08 KiB)
@btime Flux.gradient(a) do a
f(a, b)
end # 2.499 ms (515 allocations: 1.18 MiB)
#GPU
a = gpu(a); b = gpu(b);
@btime CUDA.@sync f(a, b) # 238.800 μs (224 allocations: 7.50 KiB)
@btime CUDA.@sync Flux.gradient(a) do a
f(a, b)
end # 4.745 ms (527 allocations: 20.33 KiB)
(Transformer) pkg> st
Status `C:\Users\jules\OneDrive - UGent\Documenten\code\Transformer\Project.toml`
[fbb218c0] BSON v0.2.6
[a4280ba5] BytePairEncoding v0.1.1
[052768ef] CUDA v1.3.3
[864edb3b] DataStructures v0.17.20
[587475ba] Flux v0.11.1
[63c18a36] KernelAbstractions v0.3.3
[899adc3e] TensorBoardLogger v0.1.11
[bc48ee85] Tullio v0.2.5 `C:\Users\jules\.julia\dev\Tullio`
[e88e6eb3] Zygote v0.5.6
Yes, the gradients are computed by very similar loops, details from verbose=true
:
┌ Info: symbolic gradients
│ inbody =
│ 2-element Array{Any,1}:
│ :(𝛥a[m, i, batch] = 𝛥a[m, i, batch] + conj(conj(𝛥ℛ[m, n, batch]) * b[i, n, batch]))
└ :(𝛥b[i, n, batch] = 𝛥b[i, n, batch] + conj(conj(𝛥ℛ[m, n, batch]) * a[m, i, batch]))
The whole gradient calculation is about 4x slower than forward alone for me too, both with & without LoopVectorization. That doesn't seem unreasonable. But 20x slower on the GPU isn't good.
Compared to:
julia> using NNlib
julia> @btime batched_mul($a, $b); # about the same as avx=true, 146 μs
137.310 μs (2 allocations: 234.45 KiB)
julia> fwd, back = Zygote.pullback(batched_mul, a, b);
julia> @btime back($fwd);
309.617 μs (5 allocations: 937.69 KiB)
julia> (309.617 + 137.310) / 137.310 # instead of 4x, roughly?
3.2548758284174495
julia> @btime Zygote.gradient(sum∘batched_mul, $a, $b); # FillArrays go to generic_matmul
10.543 ms (24 allocations: 1.15 MiB)
Adding nograd=b
ought to speed up the calculation of the a
gradient alone, if this is all you need. But here it appears to break @avx
, am not sure why.
What actually causes the 4x difference between forward and backwards? Is it 4x more computation or is there a large overhead?
I don't really know what the variables in Tullio's grad expression mean or what there size is, but when I'd write these rules manually would they also run 20x slower on gpu you think? Or is there something else going on.
The naiive expectation (for square matrices) would be 3x, as gradient
first runs the forward pass, and then the reverse pass has two similar multiplications, one to make 𝛥a
and one to make 𝛥b
. Here they aren't square here, the forward pass contracts the longer index i
so allocates a smaller result, maybe that changes things a little.
I am not so sure why the GPU code is fast or slow in different cases, KernelAbstractions has a lot of black magic to map loops to GPU hardware, and maybe this case is genuinely harder, or maybe the forward pass is just closer to ordinary matrix multiplication which was presumably used as a test case when writing it. The code which @tullio
produces is like the simplest example here: https://github.com/JuliaGPU/KernelAbstractions.jl/blob/master/examples/performance.jl and perhaps it needs to learn to generate things like the more complicated versions there. I haven't looked into this at all. (On the CPU, it already does some kind of memory tiling.)
Doing these races against matrix multiplication is a good way to find performance bugs, and I remain amazed how close to BLAS it sometimes gets on the CPU. But the intention was more to cover cases for which there isn't a great library function. The CuBLAS routine called by batched_mul
is pretty quick! It's possible that @tullio
could end up faster for something like permutedims
+ batched_mul
, because permutedims
is expensive and it can fuse them; I think the readme has a CPU example of this.
Since the leading dimensions here are quite small, #82 should improve this a bit, since that enables threading over more than just one dimension. Will probably still be nowhere near batched_mul
though.