Tullio.jl
Tullio.jl copied to clipboard
substantial performance issues when combining Tullio and Zygote
I'm comparing performance of the built-in Dense layer with an equivalent Linear layer and a Dist layer. There are a couple of problems.
Full code, output, Manifest, Project: https://gist.github.com/12b7cfc21482419aa39e7e9dfbb7cc32
The most important definitions are this:
function linear_kernel(w, b, x)
@tullio result[i, j] := w[k, i] * x[k, j]
return result .+ b
end
struct Linear
w
b
end
function Linear(in::Integer, out::Integer; scale=1.0)
return Linear(scale*randn(in, out) |> fp32, scale*randn(out) |> fp32)
end
Flux.@functor Linear
function(self::Linear)(x::AbstractArray)
return linear_kernel(self.w, self.b, x)
end
...
model = Linear(784, 1024) |> gpu
function loss(x)
return Flux.Losses.mse(model(x), y)
end
CUDA.@time model(x);
CUDA.@time Zygote.pullback(model, x)[2](y);
CUDA.@time gradient(loss, x)
PROBLEM 1 -- CPU PERFORMANCE
The Tullio-based Linear layer is significantly slower than the built-in Dense layer, but that is perhaps understandable, and it's fast enough to be useful.
But while the pullback/gradient are within a factor of two of the forward computation for the Dense layer, they are both 20x slower for the Tullio-based layers.
PROBLEM 2 -- GPU PERFORMANCE
On GPU, the pattern is different: the Zygote pullback computation is within a factor of two of the forward computation, but the Zygote gradient computation is three orders of magnitude slower than the forward computation. It is baffling to me that a fast pullback can result in such a slow gradient computation.
Unfortunately this is a known problem, e.g. https://github.com/mcabbott/Tullio.jl/issues/30 . I have not worked much on making this faster since then.
OK, thanks. Turns out, I needed to use CUDA.@time; with that change, the data makes more sense and both pullback and gradient are about equally slow, about 20x - 50x of the forward computation. I've updated the bug description.
For the forward pass it has a loop nest like this (from verbose=2 output):
for j = πΆπj
for i = πΆπi
ππΈπΈ = β»οΈ === nothing ? zero(π―) : β[i, j] # If it breaks the iteration on k, it re-starts with value.
for k = πΆπk # Innermost iteration is along 1st index
ππΈπΈ = ππΈπΈ + w[k, i] * x[k, j] # Does not write to output in the accumulation loop over k
end
β[i, j] = ππΈπΈ # This is safe to multi-thread over i, j
...
# This is how it encodes threading over i,j but not k (262144 is a threshold for threading)
threader(ππΈπ!, storage_type(result, w, x), result, tuple(w, x), tuple(πΆπi, πΆπj), tuple(πΆπk), +, 262144, nothing)
The gradient looks more like this:
for k = πΆπk # Outermost loop is the 1st index
for j = πΆπj
for i = πΆπi
β°π1 = conj(x[k, j]) # (CSE is generating these, although no help here)
β°π2 = π₯β[i, j] * β°π1
β°π3 = conj(w[k, i])
β°π4 = π₯β[i, j] * β°π3
π₯w[k, i] = π₯w[k, i] + β°π2 # Accumulates directly into output arrays, at every step
π₯x[k, j] = π₯x[k, j] + β°π4 # These are safe to multi-thread only along k, and it knows
end
...
βthreader(βππΈπ!, storage_type(π₯w, π₯x, w, x), tuple(π₯w, π₯x, π₯β, β, w, x), tuple(πΆπk), tuple(πΆπi, πΆπj), 262144)
One thing which could be invesigated here is how inefficien this write-to-two-arrays loop nest is. The same expression with nograd=x generates only π₯w, and it appears that it's smart enough to then thread both i and k.
for i = πΆπi
for k = πΆπk # Now first index is 2nd loop
for j = πΆπj # reduction index
β°π1 = conj(x[k, j])
β°π2 = π₯β[i, j] * β°π1
π₯w[k, i] = π₯w[k, i] + β°π2 # Still reads & writes π₯w at every j. But safe to thread i & k
end
...
βthreader)(βππΈπ!, storage_type(π₯w, π₯x, w, x), tuple(π₯w, π₯x, π₯β, β, w, x), tuple(πΆπk, πΆπi), tuple(πΆπj), 262144)
If the pullback is twice as fast that's no help, but if it's 10x faster that would motivate changing to generate a separate loop nest per gradient. Not sure I ever did this comparison. Another step would be to see how much it matters to change the pattern to something like this; harder to check as you'd have to edit the code it generates to try it:
for i = πΆπi
for k = πΆπk
acc = zero(T)
for j = πΆπj
β°π1 = conj(x[k, j])
β°π2 = π₯β[i, j] * β°π1
acc += β°π2
end
π₯w[k, i] + π₯w[k, i] + acc # read & write once
Changing the macro to implement either of these is unfortunately not a small job. It's not as modular as it ought to be... I knew precisely how to do a better job around the time I stopped working much on this, of course.
Less ambitiously, notice that the order of loops changes in these expressions. There is some chance it's picking a bad one, and this would be relatively easy to solve. To time this you should probably @macroexpand1 @tullio ... and then just find the loops & edit them. Tullio does not know that arrays are column-major, the order is chosen based on which are reduction indices, etc. It's still possible that it makes a pessimal choice for the gradient. Whether KernelAbstractions cares or fixes this, I don't know.
Tullio returns symbolic derivatives that look like they are almost directly usable as Tullio expressions; why aren't those used directly?
I seem to get decent performance now with a simple rrule using Tullio itself:
function dist_kernel(w::AbstractArray{Float32}, x::AbstractArray{Float32})
@tullio grad=false result[i, j] := (w[k, i] - x[k, j])^2
end
function dist_kernel_pb(w, x)
@tullio grad=false result[i, j] := (w[k, i] - x[k, j])^2
return result, (Ξresult) -> begin
@tullio grad=false Ξw[k, i] := + Ξresult[i, j] * 2 * (w[k, i] - x[k, j])
@tullio grad=false Ξx[k, i] := - Ξresult[i, j] * 2 * (w[k, i] - x[k, j])
return (ChainRules.NoTangent(), Ξw, Ξx)
end
end
ChainRules.rrule(::typeof(dist_kernel), w, x) = dist_kernel_pb(w, x)
I think the reason it doesn't directly write the gradients as separate calls is that it doesn't know how to solve for the shifts necessary for convolutions etc. The present re-writing just accumulates into arrays at weird indices, but it will tend to get things wrong if used more directly, like in https://github.com/mcabbott/Tullio.jl/issues/136 . Maybe that could be solved somehow.
Wouldn't that be fixable by either disallowing index computations on the LHS, or at least special casing for the case where there are no index computations on the LHS and warning about the other case?
Is there some formal specification/analysis of the expressions that Tullio accepts? It's perhaps not surprising that it's difficult to reason about these expressions otherwise.
They should be disallowed on the LHS, that they aren't is a bug. But the gradient calculation always wants to move them to the left, as you're now writing into the gradient of an input array.
Agree you could special-case the easy cases. So far this package hasn't done that at all, just to avoid duplication... it's already far too complicated.
I have not tried to write anything up. The whole package was an exploration of what's possible / what's fast. Ideally v2 would have a more fixed scope, and could be much cleaner, but I'd need to clone myself first... I believe https://github.com/mcabbott/Tullio.jl/issues/70 is a logic mistake in how it handles min/max gradients, I had a fix but didn't get to implementing it. (Nor a gradient for reductions over *.)
I was able to speed up the CPU variants ~5x by enabling threading on the Julia side (i.e. julia -t auto) to match the BLAS threadpool Dense is using. Also using BenchmarkTools.jl and CUDA.@sync for more reliable timings:
=== Dense cpu
4.393 ms (4 allocations: 8.00 MiB)
12.858 ms (16 allocations: 14.13 MiB)
18.108 ms (68 allocations: 46.13 MiB)
=== Dense gpu
1.725 ms (113 allocations: 5.61 KiB)
5.804 ms (214 allocations: 8.70 KiB)
7.904 ms (630 allocations: 33.77 KiB)
=== Linear cpu
22.505 ms (140 allocations: 8.01 MiB)
333.240 ms (388 allocations: 14.15 MiB)
336.375 ms (448 allocations: 46.15 MiB)
=== Linear gpu
141.146 ms (194 allocations: 9.50 KiB)
948.487 ms (454 allocations: 21.42 KiB)
950.427 ms (876 allocations: 46.17 KiB)
=== Dist cpu
24.443 ms (142 allocations: 12.01 MiB)
400.947 ms (396 allocations: 22.15 MiB)
403.849 ms (453 allocations: 54.15 MiB)
=== Dist gpu
142.127 ms (226 allocations: 11.23 KiB)
1.057 s (521 allocations: 24.95 KiB)
1.059 s (942 allocations: 49.67 KiB)
Also, perhaps this and/or #30 should be updated to reflect that the behaviour occurs with all reverse-mode ADs Tullio is compatible with? Then people who might be using e.g. ReverseDiff wouldn't accidentally miss it.
@ToucheSir Oops, sorry, I thought I had updated the bug report with CUDA.@sync etc. I have updated the GIST link now. In any case, I still get a 40x ratio of dist-gpu-gradient vs dist-gpu-forward on my A6000. It looks like you only get a 7.5x ratio on your GPU. I wonder whether that's due to performance differences of the GPUs.
Make sure you're using the macros from https://github.com/JuliaCI/BenchmarkTools.jl instead of @time to remove noise and get a better statistical estimate of the true time (whether that's the mean, median or minimum). CUDA.@time is still nice for measuring GPU-side memory allocations, though. But yes, my GPU is very underpowered.