Slowdown when starting Julia with threads even we do not use them
I am seeing a significant slowdown in the reverse pass of my code when running Julia with multiple threads even if the code doesn't utilize any of Julia's threads.
After chatting with @vchuravy this appears to be somewhat intentional based on these lines https://github.com/EnzymeAD/Enzyme.jl/blob/5b5623284892abc2de0b7043be8f228ba32b8511/src/compiler.jl#L4732-L4734 where Enzyme will turn on atomics if there are more than 1 thread in the Julia runtime.
To see the impact of this here is a MWE
using Enzyme
using BenchmarkTools
function mwe(y, x)
y .= x .+ x
return sum(abs2, y)
end
function mwe_loop2(y, x)
for i in eachindex(y, x)
y[i] = x[i] + x[i]
end
s = zero(eltype(y))
for i in eachindex(y)
s+= abs2(y[i])
end
return s
end
x = ones(100)
dx = zero(x)
y = zero(x)
dy = zero(y)
@info "Benchmarking BC"
bf = @benchmark mwe($y, $x)
display(bf)
br = @benchmark autodiff($Reverse, $mwe, $Active, Duplicated($y, fill!($dy, 0.)), Duplicated($x, fill!($dx, 0)))
display(br)
@info "Benchmarking Loop2"
bf = @benchmark mwe_loop2($y, $x)
display(bf)
br = @benchmark autodiff($Reverse, $mwe_loop2, $Active, Duplicated($y, fill!($dy, 0.)), Duplicated($x, fill!($dx, 0)))
display(br)
Without threads
[ Info: Benchmarking BC
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
Range (min … max): 31.761 ns … 264.089 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.738 ns ┊ GC (median): 0.00%
Time (mean ± σ): 35.813 ns ± 9.989 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄██▆▁ ▂▁
▁▂▅█████▇▅▃▂▄▆███▇▄▄▃▂▃▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
31.8 ns Histogram: frequency by time 46.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 192 evaluations per sample.
Range (min … max): 510.339 ns … 3.660 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 555.266 ns ┊ GC (median): 0.00%
Time (mean ± σ): 563.541 ns ± 91.372 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▂▂▄▄▄▆█▇▆▅▅▃▂▂
▁▁▁▁▁▁▂▂▂▄▆█████████████████▇▆▅▅▅▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁ ▄
510 ns Histogram: frequency by time 637 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Benchmarking Loop2
BenchmarkTools.Trial: 10000 samples with 979 evaluations per sample.
Range (min … max): 65.220 ns … 466.282 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 68.351 ns ┊ GC (median): 0.00%
Time (mean ± σ): 69.736 ns ± 8.729 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▅█▆▅▂ ▇▇▅▄ ▁▆▆▄▃▁ ▅▄▃▁ ▁▁ ▁ ▁▁▁ ▂
██████▅██████▆██████▇█████▇▇▇████▆███▇█▇█████▇▇████▇▅▄▆▆▆▆▅▅ █
65.2 ns Histogram: log(frequency) by time 85.2 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 199 evaluations per sample.
Range (min … max): 421.548 ns … 3.349 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 458.854 ns ┊ GC (median): 0.00%
Time (mean ± σ): 464.884 ns ± 81.191 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▇▅▁ ▂██▅ ▂▄▁
▁▁▃▄▇█▇▆▄▄▇████▇▇█████▆▇███▆▄▄▅▆▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁ ▃
422 ns Histogram: frequency by time 538 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
With 2 Threads
[ Info: Benchmarking BC
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
Range (min … max): 31.810 ns … 218.409 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 34.410 ns ┊ GC (median): 0.00%
Time (mean ± σ): 35.137 ns ± 3.610 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆▇█▆▂▁
▁▂▅███████▅▃▃▂▃▄▇███▇▅▄▃▂▂▂▄▄▅▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
31.8 ns Histogram: frequency by time 43.3 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
Range (min … max): 1.084 μs … 12.376 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.141 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.233 μs ± 332.276 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▇██▆▅▃▂▂▂▂▂▁▁▁▁▁▃▄▅▄▃▂ ▁▁▁ ▂
████████████████████████▇▆▇█▇▆▇▇██████▄▆▅▅▄▅▄▃▄▆▅▆▆▇▅▆▄▅▅▂▂ █
1.08 μs Histogram: log(frequency) by time 2 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Benchmarking Loop2
BenchmarkTools.Trial: 10000 samples with 979 evaluations per sample.
Range (min … max): 65.220 ns … 480.773 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 68.362 ns ┊ GC (median): 0.00%
Time (mean ± σ): 69.613 ns ± 11.946 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆▆▅▃▂ ▄██▇▅▄▂ ▂▆▆▅▄▂▁ ▃▃▃▂▁▁▁ ▁ ▁ ▁ ▂
███████▆▄▃███████▇▆███████▇▅█████████▇██████████▆▇▆▇▆█▅▆▅▅▆▅ █
65.2 ns Histogram: log(frequency) by time 79.8 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 114 evaluations per sample.
Range (min … max): 754.316 ns … 5.518 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 813.719 ns ┊ GC (median): 0.00%
Time (mean ± σ): 830.276 ns ± 218.468 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▄▄ ▆█▃
▁▁▁▂▃▄▄▂▂▁▂▃▇██▇▃▂▂▆███▅▂▂▃▇█▇▃▂▂▂▃▄▃▂▂▂▁▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▁▁▁▁▁ ▂
754 ns Histogram: frequency by time 925 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
So while the forward pass is identical in time, the reverse pass is now 2x slower for the broadcasted version and ~40% slower for the loopy version.
Yes this is intentional, and required if enztme autodiff is itself called from multiple threads with the same memory, requiring atomic updates
For sure. The problem is that for my use case, it is ruining any performance gains from threading.
Looking at the IR and the native code with Paul the Crux is that atomicrmw lowers to a cmpxchg loop for floating point operations. On the GPU we have dedicated instructions for this.
Of course still a bit of a mystery that the broadcast slows down more than the for loop version
Possibly aliasing or other allocation info?
If we can prove there would be no race we won't emit atomics. But the analysis for that is very conservative atm
Honestly this is where I'd recommend reactant to have the whole program view for paralleizarion scheduling
We could add a flag to the mode for thread unsafe vs thread safe vs auto, and then set the flag optionally.
If you know autodiff is thread safe then you could disable