FastBroadcast.jl Threading kills performance when extracting values from a vector of dual numbers

This is strangely specific, but look, a factor ~200 slowdown when enabling multithreading:

julia> using BenchmarkTools, FastBroadcast, ForwardDiff

julia> N = 100; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);

julia> @btime @. $v = ForwardDiff.value($x);  # Baseline
  17.873 ns (0 allocations: 0 bytes)

julia> @btime @.. $v = ForwardDiff.value($x);
  16.712 ns (0 allocations: 0 bytes)

julia> @btime @.. thread=true $v = ForwardDiff.value($x);
  3.101 μs (1 allocation: 48 bytes)

It's the same when extracting partials. Is there something special about the Dual struct layout or how these functions are written that causes multithreaded broadcasting to go haywire here?

May 27 '22 03:05 danielwe

Looks like an overhead issue, because when the arrays get large enough the tables are turned:

julia> N = 40000; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);

julia> @btime @. $v = ForwardDiff.value($x);  # Baseline
  13.449 μs (0 allocations: 0 bytes)

julia> @btime @.. $v = ForwardDiff.value($x);
  13.368 μs (0 allocations: 0 bytes)

julia> @btime @.. thread=true $v = ForwardDiff.value($x);
  4.115 μs (1 allocation: 48 bytes)

But 3 μs is some overhead.

In these examples, julia is running with 28 threads on a Xeon with 28 cores. The overhead goes down if I reduce the number of julia threads, but is still almost 1.4 μs at 4 threads.

May 27 '22 03:05 danielwe

In these examples, julia is running with 28 threads on a Xeon with 28 cores. The overhead goes down if I reduce the number of julia threads, but is still almost 1.4 μs at 4 threads.

Well, it's doing a lot better than Base Julia in this respect.

julia> @benchmark @.. thread=true $v = ForwardDiff.value($x)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.520 μs …  1.780 ms  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.635 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.971 μs ± 19.927 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▄▆▇██▇▅▄▃
  ▂▂▂▂▂▂▃▄▄▅▇████████████▆▅▄▃▃▃▂▂▂▂▂▁▂▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▄
  1.52 μs        Histogram: frequency by time        1.91 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark fetch(Threads.@spawn nothing)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  49.908 μs … 84.111 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     54.827 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.368 μs ±  3.337 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

        ▄▇█▇▄
  ▁▁▁▂▅███████▆▄▄▆███▇▇█▇▇▆▆▆▅▅▅▆▅▅▆▅█▆▅▄▃▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁ ▃
  49.9 μs         Histogram: frequency by time        65.7 μs <

 Memory estimate: 512 bytes, allocs estimate: 5.

julia> Threads.nthreads()
28

May 27 '22 07:05 chriselrod

We could add a minimum batch size argument.

LoopVectorization won't support dual numbers until the rewrite unfortunately, and the rewrite won't support threading for some time. But it does try and use cost modeling to pick a reasonable size threshold automatically. So, you could:

using LoopVectorization, ForwardDiff
function turbo_value!(v::AbstractVector{Float64}, x::AbstractVector{<:ForwardDiff.Dual{<:Any,Float64}})
    @assert axes(v,1) == axes(x,1) == Base.oneto(length(v)) 
    xd = reinterpret(reshape, Float64, x)
    @tturbo for i = eachindex(v)
         v[i] = xd[1, i]
    end
    return v
end

On an 18 core computer, I get:

julia> N = 100; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);

julia> @btime @. $v = ForwardDiff.value($x);  # Baseline
  9.115 ns (0 allocations: 0 bytes)

julia> @btime @.. $v = ForwardDiff.value($x);
  8.427 ns (0 allocations: 0 bytes)

julia> @btime @.. thread=true $v = ForwardDiff.value($x);
  2.104 μs (0 allocations: 0 bytes)

julia> @btime turbo_value!($v, $x);
  20.477 ns (0 allocations: 0 bytes)

julia> N = 40000; x = ForwardDiff.Dual.(randn(N), randn(N)); v = zeros(N);

julia> @btime @. $v = ForwardDiff.value($x);  # Baseline
  14.981 μs (0 allocations: 0 bytes)

julia> @btime @.. $v = ForwardDiff.value($x);
  14.927 μs (0 allocations: 0 bytes)

julia> @btime @.. thread=true $v = ForwardDiff.value($x);
  2.727 μs (0 allocations: 0 bytes)

julia> @btime turbo_value!($v, $x);
  1.863 μs (0 allocations: 0 bytes)

julia> Threads.nthreads()
36

May 27 '22 07:05 chriselrod

So is the issue just that threading always has this overhead and I haven't noticed before because I've been using it on more computationally heavy workloads? Notice that in my case there's an allocation that you don't seem to get.

May 27 '22 15:05 danielwe

So is the issue just that threading always has this overhead and I haven't noticed before because I've been using it on more computationally heavy workloads? Notice that in my case there's an allocation that you don't seem to get.

I fixed it after seeing you report the allocation and before running my own benchmarks.

Yes, threading always has large overhead. Perhaps it's larger than it should be because Polyester fires up threads sequentially. Maybe a strategy of doing so with a bifurcating tree would scale better, i.e. O(log2(nthreads())) istead of O(nthreads()).

Polyester.jl itself supports a minimum batch size argument. A solution would be to allow passing it through FastBroadcast, maybe even with a default option.

May 27 '22 16:05 chriselrod

I see, thanks for clearing that up. My arrays are usually large enough that threading is almost always beneficial (much larger than the N = 100 I used in the MWE), so I just expected that to be the case here too. But I'm usually doing serious math on the elements, while in this case, I was just copying over half the data, which is obviously very different. Feel free to close this issue unless you want to keep it for the potential thread scheduling improvements.

May 27 '22 16:05 danielwe

Is there an option for Polyester here?

May 27 '22 18:05 ChrisRackauckas

Is there an option for Polyester here?

@batch minbatch= is the easiest way when using Polyester directly. @batch sets up a call to batch. But FastBroadcast calls batch directly.

batch accepts a tuple (len, nbatches), where len is the number of iterations, and nbatches the number of batches to subdivide len into. FastBroadcast currently divides into Threads.nthreads() batches: https://github.com/YingboMa/FastBroadcast.jl/blob/fd6721234d1b81f18a8fd023ad89abb89cf2e7f9/src/FastBroadcast.jl#L62 So we could instead be cleverer about picking nbatches.

One approach is to just let people pass an argument manually, like @batch does. We could also come up with some heuristics. Someone could be broadcasting an ODE solver, where they probably do want 1 thread/iteration, so I'm not sure if just picking a default minbatch=100 is what most people would want. But at that point, they're probably better off with a scheduler that tries to balance loads than a static scheduler, so they probably should just stick with @spawn or @threads :dynamic for.

If we're being overly clever, we're reinventing LoopVectorization.jl (which comes up with those costs automatically, hence why it only had around 10ns overhead instead of >1 microsecond for the N=100 case).

May 27 '22 18:05 chriselrod

FastBroadcast.jl FastBroadcast.jl copied to clipboard

Threading kills performance when extracting values from a vector of dual numbers

FastBroadcast.jl
FastBroadcast.jl copied to clipboard