FastBroadcast.jl
FastBroadcast.jl copied to clipboard
Threading kills performance when extracting values from a vector of dual numbers
This is strangely specific, but look, a factor ~200 slowdown when enabling multithreading:
julia> using BenchmarkTools, FastBroadcast, ForwardDiff
julia> N = 100; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);
julia> @btime @. $v = ForwardDiff.value($x); # Baseline
17.873 ns (0 allocations: 0 bytes)
julia> @btime @.. $v = ForwardDiff.value($x);
16.712 ns (0 allocations: 0 bytes)
julia> @btime @.. thread=true $v = ForwardDiff.value($x);
3.101 μs (1 allocation: 48 bytes)
It's the same when extracting partials. Is there something special about the Dual struct layout or how these functions are written that causes multithreaded broadcasting to go haywire here?
Looks like an overhead issue, because when the arrays get large enough the tables are turned:
julia> N = 40000; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);
julia> @btime @. $v = ForwardDiff.value($x); # Baseline
13.449 μs (0 allocations: 0 bytes)
julia> @btime @.. $v = ForwardDiff.value($x);
13.368 μs (0 allocations: 0 bytes)
julia> @btime @.. thread=true $v = ForwardDiff.value($x);
4.115 μs (1 allocation: 48 bytes)
But 3 μs is some overhead.
In these examples, julia is running with 28 threads on a Xeon with 28 cores. The overhead goes down if I reduce the number of julia threads, but is still almost 1.4 μs at 4 threads.
In these examples, julia is running with 28 threads on a Xeon with 28 cores. The overhead goes down if I reduce the number of julia threads, but is still almost 1.4 μs at 4 threads.
Well, it's doing a lot better than Base Julia in this respect.
julia> @benchmark @.. thread=true $v = ForwardDiff.value($x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.520 μs … 1.780 ms ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.635 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.971 μs ± 19.927 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▆▇██▇▅▄▃
▂▂▂▂▂▂▃▄▄▅▇████████████▆▅▄▃▃▃▂▂▂▂▂▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
1.52 μs Histogram: frequency by time 1.91 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark fetch(Threads.@spawn nothing)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 49.908 μs … 84.111 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 54.827 μs ┊ GC (median): 0.00%
Time (mean ± σ): 55.368 μs ± 3.337 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇█▇▄
▁▁▁▂▅███████▆▄▄▆███▇▇█▇▇▆▆▆▅▅▅▆▅▅▆▅█▆▅▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▃
49.9 μs Histogram: frequency by time 65.7 μs <
Memory estimate: 512 bytes, allocs estimate: 5.
julia> Threads.nthreads()
28
We could add a minimum batch size argument.
LoopVectorization won't support dual numbers until the rewrite unfortunately, and the rewrite won't support threading for some time. But it does try and use cost modeling to pick a reasonable size threshold automatically. So, you could:
using LoopVectorization, ForwardDiff
function turbo_value!(v::AbstractVector{Float64}, x::AbstractVector{<:ForwardDiff.Dual{<:Any,Float64}})
@assert axes(v,1) == axes(x,1) == Base.oneto(length(v))
xd = reinterpret(reshape, Float64, x)
@tturbo for i = eachindex(v)
v[i] = xd[1, i]
end
return v
end
On an 18 core computer, I get:
julia> N = 100; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);
julia> @btime @. $v = ForwardDiff.value($x); # Baseline
9.115 ns (0 allocations: 0 bytes)
julia> @btime @.. $v = ForwardDiff.value($x);
8.427 ns (0 allocations: 0 bytes)
julia> @btime @.. thread=true $v = ForwardDiff.value($x);
2.104 μs (0 allocations: 0 bytes)
julia> @btime turbo_value!($v, $x);
20.477 ns (0 allocations: 0 bytes)
julia> N = 40000; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);
julia> @btime @. $v = ForwardDiff.value($x); # Baseline
14.981 μs (0 allocations: 0 bytes)
julia> @btime @.. $v = ForwardDiff.value($x);
14.927 μs (0 allocations: 0 bytes)
julia> @btime @.. thread=true $v = ForwardDiff.value($x);
2.727 μs (0 allocations: 0 bytes)
julia> @btime turbo_value!($v, $x);
1.863 μs (0 allocations: 0 bytes)
julia> Threads.nthreads()
36
So is the issue just that threading always has this overhead and I haven't noticed before because I've been using it on more computationally heavy workloads? Notice that in my case there's an allocation that you don't seem to get.
So is the issue just that threading always has this overhead and I haven't noticed before because I've been using it on more computationally heavy workloads? Notice that in my case there's an allocation that you don't seem to get.
I fixed it after seeing you report the allocation and before running my own benchmarks.
Yes, threading always has large overhead. Perhaps it's larger than it should be because Polyester fires up threads sequentially. Maybe a strategy of doing so with a bifurcating tree would scale better, i.e. O(log2(nthreads())) istead of O(nthreads()).
Polyester.jl itself supports a minimum batch size argument. A solution would be to allow passing it through FastBroadcast, maybe even with a default option.
I see, thanks for clearing that up. My arrays are usually large enough that threading is almost always beneficial (much larger than the N = 100 I used in the MWE), so I just expected that to be the case here too. But I'm usually doing serious math on the elements, while in this case, I was just copying over half the data, which is obviously very different. Feel free to close this issue unless you want to keep it for the potential thread scheduling improvements.
Is there an option for Polyester here?
Is there an option for Polyester here?
@batch minbatch=
is the easiest way when using Polyester directly. @batch
sets up a call to batch
.
But FastBroadcast
calls batch
directly.
batch
accepts a tuple (len, nbatches)
, where len
is the number of iterations, and nbatches
the number of batches to subdivide len
into.
FastBroadcast currently divides into Threads.nthreads()
batches:
https://github.com/YingboMa/FastBroadcast.jl/blob/fd6721234d1b81f18a8fd023ad89abb89cf2e7f9/src/FastBroadcast.jl#L62
So we could instead be cleverer about picking nbatches
.
One approach is to just let people pass an argument manually, like @batch
does.
We could also come up with some heuristics. Someone could be broadcasting an ODE solver, where they probably do want 1 thread/iteration, so I'm not sure if just picking a default minbatch=100
is what most people would want.
But at that point, they're probably better off with a scheduler that tries to balance loads than a static scheduler, so they probably should just stick with @spawn
or @threads :dynamic for
.
If we're being overly clever, we're reinventing LoopVectorization.jl (which comes up with those costs automatically, hence why it only had around 10ns overhead instead of >1 microsecond for the N=100 case).