Enzyme.jl icon indicating copy to clipboard operation
Enzyme.jl copied to clipboard

Slowdown when starting Julia with threads even we do not use them

Open ptiede opened this issue 2 months ago • 5 comments

I am seeing a significant slowdown in the reverse pass of my code when running Julia with multiple threads even if the code doesn't utilize any of Julia's threads.

After chatting with @vchuravy this appears to be somewhat intentional based on these lines https://github.com/EnzymeAD/Enzyme.jl/blob/5b5623284892abc2de0b7043be8f228ba32b8511/src/compiler.jl#L4732-L4734 where Enzyme will turn on atomics if there are more than 1 thread in the Julia runtime.

To see the impact of this here is a MWE

using Enzyme
using BenchmarkTools

function mwe(y, x)
    y .= x .+ x
    return sum(abs2, y)
end

function mwe_loop2(y, x)
    for i in eachindex(y, x)
        y[i] = x[i] + x[i]
    end
    
    s = zero(eltype(y))
    for i in eachindex(y)
        s+= abs2(y[i])
    end
    return s
end



x = ones(100)
dx = zero(x)

y = zero(x)
dy = zero(y)

@info "Benchmarking BC"
bf = @benchmark mwe($y, $x)
display(bf)
br = @benchmark autodiff($Reverse, $mwe, $Active, Duplicated($y, fill!($dy, 0.)), Duplicated($x, fill!($dx, 0)))
display(br)

@info "Benchmarking Loop2"
bf = @benchmark mwe_loop2($y, $x)
display(bf)
br = @benchmark autodiff($Reverse, $mwe_loop2, $Active, Duplicated($y, fill!($dy, 0.)), Duplicated($x, fill!($dx, 0)))
display(br)

Without threads

[ Info: Benchmarking BC
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (min … max):  31.761 ns … 264.089 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.738 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.813 ns ±   9.989 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▄██▆▁      ▂▁                                              
  ▁▂▅█████▇▅▃▂▄▆███▇▄▄▃▂▃▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  31.8 ns         Histogram: frequency by time         46.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 192 evaluations per sample.
 Range (min … max):  510.339 ns …  3.660 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     555.266 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   563.541 ns ± 91.372 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▂▂▂▄▄▄▆█▇▆▅▅▃▂▂                                   
  ▁▁▁▁▁▁▂▂▂▄▆█████████████████▇▆▅▅▅▄▄▃▃▃▃▃▃▃▃▂▂▂▂▂▂▂▂▁▂▂▂▁▁▁▁▁ ▄
  510 ns          Histogram: frequency by time          637 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Benchmarking Loop2
BenchmarkTools.Trial: 10000 samples with 979 evaluations per sample.
 Range (min … max):  65.220 ns … 466.282 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     68.351 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   69.736 ns ±   8.729 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▅█▆▅▂   ▇▇▅▄  ▁▆▆▄▃▁  ▅▄▃▁   ▁▁    ▁     ▁▁▁                 ▂
  ██████▅██████▆██████▇█████▇▇▇████▆███▇█▇█████▇▇████▇▅▄▆▆▆▆▅▅ █
  65.2 ns       Histogram: log(frequency) by time      85.2 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 199 evaluations per sample.
 Range (min … max):  421.548 ns …  3.349 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     458.854 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   464.884 ns ± 81.191 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▄▇▅▁  ▂██▅   ▂▄▁                                   
  ▁▁▃▄▇█▇▆▄▄▇████▇▇█████▆▇███▆▄▄▅▆▅▄▄▃▃▃▃▃▃▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁ ▃
  422 ns          Histogram: frequency by time          538 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

With 2 Threads

[ Info: Benchmarking BC
BenchmarkTools.Trial: 10000 samples with 994 evaluations per sample.
 Range (min … max):  31.810 ns … 218.409 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     34.410 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   35.137 ns ±   3.610 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▁▆▇█▆▂▁                                                    
  ▁▂▅███████▅▃▃▂▃▄▇███▇▅▄▃▂▂▂▄▄▅▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  31.8 ns         Histogram: frequency by time         43.3 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 10 evaluations per sample.
 Range (min … max):  1.084 μs …  12.376 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.141 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.233 μs ± 332.276 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃▇██▆▅▃▂▂▂▂▂▁▁▁▁▁▃▄▅▄▃▂          ▁▁▁                        ▂
  ████████████████████████▇▆▇█▇▆▇▇██████▄▆▅▅▄▅▄▃▄▆▅▆▆▇▅▆▄▅▅▂▂ █
  1.08 μs      Histogram: log(frequency) by time         2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
[ Info: Benchmarking Loop2
BenchmarkTools.Trial: 10000 samples with 979 evaluations per sample.
 Range (min … max):  65.220 ns … 480.773 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     68.362 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   69.613 ns ±  11.946 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▆▆▅▃▂    ▄██▇▅▄▂  ▂▆▆▅▄▂▁   ▃▃▃▂▁▁▁   ▁   ▁ ▁               ▂
  ███████▆▄▃███████▇▆███████▇▅█████████▇██████████▆▇▆▇▆█▅▆▅▅▆▅ █
  65.2 ns       Histogram: log(frequency) by time      79.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.
BenchmarkTools.Trial: 10000 samples with 114 evaluations per sample.
 Range (min … max):  754.316 ns …   5.518 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     813.719 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   830.276 ns ± 218.468 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

               ▄▄     ▆█▃                                        
  ▁▁▁▂▃▄▄▂▂▁▂▃▇██▇▃▂▂▆███▅▂▂▃▇█▇▃▂▂▂▃▄▃▂▂▂▁▂▂▂▂▂▂▁▁▁▂▂▂▁▁▁▁▁▁▁▁ ▂
  754 ns           Histogram: frequency by time          925 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

So while the forward pass is identical in time, the reverse pass is now 2x slower for the broadcasted version and ~40% slower for the loopy version.

ptiede avatar Oct 07 '25 19:10 ptiede

Yes this is intentional, and required if enztme autodiff is itself called from multiple threads with the same memory, requiring atomic updates

wsmoses avatar Oct 07 '25 19:10 wsmoses

For sure. The problem is that for my use case, it is ruining any performance gains from threading.

ptiede avatar Oct 07 '25 20:10 ptiede

Looking at the IR and the native code with Paul the Crux is that atomicrmw lowers to a cmpxchg loop for floating point operations. On the GPU we have dedicated instructions for this.

Of course still a bit of a mystery that the broadcast slows down more than the for loop version

vchuravy avatar Oct 07 '25 21:10 vchuravy

Possibly aliasing or other allocation info?

If we can prove there would be no race we won't emit atomics. But the analysis for that is very conservative atm

Honestly this is where I'd recommend reactant to have the whole program view for paralleizarion scheduling

wsmoses avatar Oct 07 '25 21:10 wsmoses

We could add a flag to the mode for thread unsafe vs thread safe vs auto, and then set the flag optionally.

If you know autodiff is thread safe then you could disable

wsmoses avatar Oct 07 '25 21:10 wsmoses