Polyester.jl icon indicating copy to clipboard operation
Polyester.jl copied to clipboard

`@batch` slows down other non-@batched code on x86

Open efaulhaber opened this issue 2 years ago • 8 comments

Consider the following functions.

using Polyester

function reset_x!(x)
    x .= 0
end

function without_batch(x)
    for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

function with_batch(x)
    @batch for i in 1:100_000
        if x[i] != 0
            # This is never executed, x contains only zeros
            sin(x[i])
        end
    end
end

Running with_batch between reset_x! calls seems to somehow slow down reset_x! significantly:

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  92.173 μs … 426.870 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     94.362 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   94.776 μs ±   3.789 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

       ▂▄▅▆▇██████▇▇▆▅▅▄▃▂▂▂ ▁▁▁  ▁     ▁▁▁▁▁▁▁                ▃
  ▄▄▄▆███████████████████████████▇██▇▇█▇████████████▇▆▆▇▇▆▆▆▆▆ █
  92.2 μs       Histogram: log(frequency) by time       101 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  185.404 μs … 518.292 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     188.394 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   189.119 μs ±   4.086 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▅▇█▇▇▇▅▅▃▂                                             
  ▁▁▁▂▃▆▇█████████████▇▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  185 μs           Histogram: frequency by time          198 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 128 × AMD Ryzen Threadripper 3990X 64-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
  Threads: 8 on 128 virtual cores

I also tested and reproduced this on an Intel i9-10980XE, but the difference there was only ~10% on 8 threads.

CC: @sloede @ranocha @LasNikas.

efaulhaber avatar May 30 '23 13:05 efaulhaber

I see the same thing with Threads.@threads.

julia> using Polyester

julia> function reset_x!(x)
           x .= 0
       end
reset_x! (generic function with 1 method)

julia> function without_batch(x)
           for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
without_batch (generic function with 1 method)

julia> function with_batch(x)
           @batch for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_batch (generic function with 1 method)

julia> function with_thread(x)
           Threads.@threads for i in 1:100_000
               if x[i] != 0
                   # This is never executed, x contains only zeros
                   sin(x[i])
               end
           end
       end
with_thread (generic function with 1 method)

julia> 

julia> x = zeros(Int, 1_000_000);

julia> using BenchmarkTools

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
 Range (min … max):  272.081 μs … 319.111 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     273.518 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   273.863 μs ±   2.094 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▅▇▂▄▄██▄▅▅▇▅▁▃▃▂                                              
  ▆████████████████▆▆▇▆▇▇▆▅█▇▅▄▄▃▃▂▂▁▂▁▁▁▂▂▁▂▁▁▁▂▂▁▂▂▁▁▁▁▁▂▂▂▁▂ ▄
  272 μs           Histogram: frequency by time          280 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 2809 samples with 1 evaluation.
 Range (min … max):  312.843 μs … 350.793 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     316.288 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   316.484 μs ±   2.223 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▂      ▂▄▄▆▅▅▄▅▅▅▅▇▇▅▇███▆█▆▄▁▁▃▁▂▂▁▂ ▂ ▁ ▁ ▁               
  ▃███▆▅▆▆▄▆██████████████████████████████████████▆█▇▄▄▃▃▃▂▃▄▄▂ ▆
  313 μs           Histogram: frequency by time          321 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 2063 samples with 1 evaluation.
 Range (min … max):  313.300 μs … 392.157 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     359.390 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   358.813 μs ±  10.205 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                    ▂▄▄▄█▄▅▄▇▅▅▄▃▂▂ ▁            
  ▂▃▁▁▂▂▂▂▁▃▂▁▂▂▂▂▃▂▂▃▂▃▃▃▄▃▃▄▄▅▅▇███████████████████▆▆▅▄▄▃▃▃▃▃ ▄
  313 μs           Histogram: frequency by time          381 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

chriselrod avatar May 30 '23 14:05 chriselrod

You're right. It's even worse for me:

julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  91.472 μs … 414.889 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     92.852 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   93.118 μs ±   3.508 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

    ▁▃▃▅▅▆▆▇▇████▇▇▆▅▅▄▃▃▂▂▁▁                            ▁     ▃
  ▆████████████████████████████▇▆▆█▇█▇█▇▇█▆▆▆▇▆▆▆▆▇▇███████▇█▇ █
  91.5 μs       Histogram: log(frequency) by time      97.7 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  191.675 μs … 533.093 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     194.245 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   194.754 μs ±   3.916 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▃▅▇███▅▄▂                                             
  ▁▁▁▁▂▃▄▅███████████▆▅▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  192 μs           Histogram: frequency by time          203 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  201.894 μs … 557.973 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     205.595 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   206.066 μs ±   6.377 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

              ▁▃▅▆▆█▇██▅▆▄▂                                      
  ▁▁▁▁▁▂▂▃▄▅▆███████████████▇▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁ ▃
  202 μs           Histogram: frequency by time          213 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Where would I report this to then? JuliaLang/julia?

efaulhaber avatar May 30 '23 14:05 efaulhaber

It may be interesting to uncover why. Possible ideas:

  1. Threads take a while to go to sleep; they can hog system resources until then.
  2. Clock speeds. Could mess with bios settings; I used to use the same speeds independent of the number of cores/disable boosting to mitigate this, but I believe I reverted to stock settings.
  3. Even though the code is not executed, perhaps prefetchers still fetch the memory, and this causes some synchronization overhead?

You could experiment by playing with https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/e7f2f4ba725cf8862f42cb34b83916e3561c15f8/src/threadtasks.jl#L24 for Polyester, or https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_THREAD_SLEEP_THRESHOLD for Threads.@threads. I tried setting it to 1 or 0, but this did not help.

chriselrod avatar May 30 '23 14:05 chriselrod

I tried this as well.

julia> @benchmark reset_x!($x) setup=(with_batch(x); ThreadingUtilities.sleep_all_tasks())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  197.295 μs … 540.873 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     200.054 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   200.487 μs ±   3.949 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

             ▂▅▆███▅▃▁                                           
  ▁▁▁▁▂▂▃▄▅▇███████████▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  197 μs           Histogram: frequency by time          208 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Wouldn't that rule out your first idea?

efaulhaber avatar May 30 '23 14:05 efaulhaber

Wouldn't that rule out your first idea?

Yes.

chriselrod avatar May 30 '23 14:05 chriselrod

You could try counting cpu cycles with cpucycle.

chriselrod avatar May 30 '23 14:05 chriselrod

I'll come back to this when I have more time. For now, I opened an issue in JuliaLang/julia.

efaulhaber avatar May 31 '23 12:05 efaulhaber

Note that by default, Polyester only uses one thread per physical core. You can change this (@batch per=thread). You could also get @threads to only use one per core, by starting Julia with only one per core.

Those experiments could tell you if the Polyester.@batch vs Threads.@threads difference is only a function of the number of threads they're using.

chriselrod avatar May 31 '23 13:05 chriselrod