Polyester.jl `@batch` slows down other non-@batched code on x86

Consider the following functions.

using Polyester

function reset_x!(x)
    x .= 0
end

function without_batch(x)
    for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

function with_batch(x)
    @batch for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

Running with_batch between reset_x! calls seems to somehow slow down reset_x! significantly:

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  92.173 μs … 426.870 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.362 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.776 μs ±   3.789 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▄▅▆▇██████▇▇▆▅▅▄▃▂▂▂ ▁▁▁  ▁     ▁▁▁▁▁▁▁                ▃
  ▄▄▄▆███████████████████████████▇██▇▇█▇████████████▇▆▆▇▇▆▆▆▆▆ █
  92.2 μs       Histogram: log(frequency) by time       101 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  185.404 μs … 518.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     188.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   189.119 μs ±   4.086 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅▇█▇▇▇▅▅▃▂                                             
  ▁▁▁▂▃▆▇█████████████▇▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  185 μs           Histogram: frequency by time          198 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD Ryzen Threadripper 3990X 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 8 on 128 virtual cores

I also tested and reproduced this on an Intel i9-10980XE, but the difference there was only ~10% on 8 threads.

CC: @sloede @ranocha @LasNikas.

May 30 '23 13:05 efaulhaber

I see the same thing with Threads.@threads.

julia> using Polyester

julia> function reset_x!(x)
           x .= 0
       end
reset_x! (generic function with 1 method)

julia> function without_batch(x)
           for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
without_batch (generic function with 1 method)

julia> function with_batch(x)
           @batch for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_batch (generic function with 1 method)

julia> function with_thread(x)
           Threads.@threads for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_thread (generic function with 1 method)

julia> 

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
 Range (min … max):  272.081 μs … 319.111 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     273.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   273.863 μs ±   2.094 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇▂▄▄██▄▅▅▇▅▁▃▃▂                                              
  ▆████████████████▆▆▇▆▇▇▆▅█▇▅▄▄▃▃▂▂▁▂▁▁▁▂▂▁▂▁▁▁▂▂▁▂▂▁▁▁▁▁▂▂▂▁▂ ▄
  272 μs           Histogram: frequency by time          280 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 2809 samples with 1 evaluation.
 Range (min … max):  312.843 μs … 350.793 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     316.288 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   316.484 μs ±   2.223 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▂      ▂▄▄▆▅▅▄▅▅▅▅▇▇▅▇███▆█▆▄▁▁▃▁▂▂▁▂ ▂ ▁ ▁ ▁               
  ▃███▆▅▆▆▄▆██████████████████████████████████████▆█▇▄▄▃▃▃▂▃▄▄▂ ▆
  313 μs           Histogram: frequency by time          321 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 2063 samples with 1 evaluation.
 Range (min … max):  313.300 μs … 392.157 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     359.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   358.813 μs ±  10.205 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▂▄▄▄█▄▅▄▇▅▅▄▃▂▂ ▁            
  ▂▃▁▁▂▂▂▂▁▃▂▁▂▂▂▂▃▂▂▃▂▃▃▃▄▃▃▄▄▅▅▇███████████████████▆▆▅▄▄▃▃▃▃▃ ▄
  313 μs           Histogram: frequency by time          381 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

May 30 '23 14:05 chriselrod

You're right. It's even worse for me:

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  91.472 μs … 414.889 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     92.852 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.118 μs ±   3.508 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▃▅▅▆▆▇▇████▇▇▆▅▅▄▃▃▂▂▁▁                            ▁     ▃
  ▆████████████████████████████▇▆▆█▇█▇█▇▇█▆▆▆▇▆▆▆▆▇▇███████▇█▇ █
  91.5 μs       Histogram: log(frequency) by time      97.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.675 μs … 533.093 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     194.245 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   194.754 μs ±   3.916 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃▅▇███▅▄▂                                             
  ▁▁▁▁▂▃▄▅███████████▆▅▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  192 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  201.894 μs … 557.973 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     205.595 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   206.066 μs ±   6.377 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▅▆▆█▇██▅▆▄▂                                      
  ▁▁▁▁▁▂▂▃▄▅▆███████████████▇▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁ ▃
  202 μs           Histogram: frequency by time          213 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Where would I report this to then? JuliaLang/julia?

May 30 '23 14:05 efaulhaber

It may be interesting to uncover why. Possible ideas:

Threads take a while to go to sleep; they can hog system resources until then.
Clock speeds. Could mess with bios settings; I used to use the same speeds independent of the number of cores/disable boosting to mitigate this, but I believe I reverted to stock settings.
Even though the code is not executed, perhaps prefetchers still fetch the memory, and this causes some synchronization overhead?

You could experiment by playing with https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/e7f2f4ba725cf8862f42cb34b83916e3561c15f8/src/threadtasks.jl#L24 for Polyester, or https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_THREAD_SLEEP_THRESHOLD for Threads.@threads. I tried setting it to 1 or 0, but this did not help.

May 30 '23 14:05 chriselrod

I tried this as well.

julia> @benchmark reset_x!($x) setup=(with_batch(x); ThreadingUtilities.sleep_all_tasks())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  197.295 μs … 540.873 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     200.054 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.487 μs ±   3.949 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▅▆███▅▃▁                                           
  ▁▁▁▁▂▂▃▄▅▇███████████▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  197 μs           Histogram: frequency by time          208 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Wouldn't that rule out your first idea?

May 30 '23 14:05 efaulhaber

Wouldn't that rule out your first idea?

Yes.

May 30 '23 14:05 chriselrod

You could try counting cpu cycles with cpucycle.

May 30 '23 14:05 chriselrod

I'll come back to this when I have more time. For now, I opened an issue in JuliaLang/julia.

May 31 '23 12:05 efaulhaber

Note that by default, Polyester only uses one thread per physical core. You can change this (@batch per=thread). You could also get @threads to only use one per core, by starting Julia with only one per core.

Those experiments could tell you if the Polyester.@batch vs Threads.@threads difference is only a function of the number of threads they're using.

May 31 '23 13:05 chriselrod

Polyester.jl Polyester.jl copied to clipboard

`@batch` slows down other non-@batched code on x86

Polyester.jl
Polyester.jl copied to clipboard