Polyester.jl
Polyester.jl copied to clipboard
`@batch` slows down other non-@batched code on x86
Consider the following functions.
using Polyester
function reset_x!(x)
x .= 0
end
function without_batch(x)
for i in 1:100_000
if x[i] != 0
# This is never executed, x contains only zeros
sin(x[i])
end
end
end
function with_batch(x)
@batch for i in 1:100_000
if x[i] != 0
# This is never executed, x contains only zeros
sin(x[i])
end
end
end
Running with_batch between reset_x! calls seems to somehow slow down reset_x! significantly:
julia> x = zeros(Int, 1_000_000);
julia> using BenchmarkTools
julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 92.173 μs … 426.870 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 94.362 μs ┊ GC (median): 0.00%
Time (mean ± σ): 94.776 μs ± 3.789 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▅▆▇██████▇▇▆▅▅▄▃▂▂▂ ▁▁▁ ▁ ▁▁▁▁▁▁▁ ▃
▄▄▄▆███████████████████████████▇██▇▇█▇████████████▇▆▆▇▇▆▆▆▆▆ █
92.2 μs Histogram: log(frequency) by time 101 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 185.404 μs … 518.292 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 188.394 μs ┊ GC (median): 0.00%
Time (mean ± σ): 189.119 μs ± 4.086 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇█▇▇▇▅▅▃▂
▁▁▁▂▃▆▇█████████████▇▆▅▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
185 μs Histogram: frequency by time 198 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> versioninfo()
Julia Version 1.9.0
Commit 8e630552924 (2023-05-07 11:25 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 128 × AMD Ryzen Threadripper 3990X 64-Core Processor
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, znver2)
Threads: 8 on 128 virtual cores
I also tested and reproduced this on an Intel i9-10980XE, but the difference there was only ~10% on 8 threads.
CC: @sloede @ranocha @LasNikas.
I see the same thing with Threads.@threads.
julia> using Polyester
julia> function reset_x!(x)
x .= 0
end
reset_x! (generic function with 1 method)
julia> function without_batch(x)
for i in 1:100_000
if x[i] != 0
# This is never executed, x contains only zeros
sin(x[i])
end
end
end
without_batch (generic function with 1 method)
julia> function with_batch(x)
@batch for i in 1:100_000
if x[i] != 0
# This is never executed, x contains only zeros
sin(x[i])
end
end
end
with_batch (generic function with 1 method)
julia> function with_thread(x)
Threads.@threads for i in 1:100_000
if x[i] != 0
# This is never executed, x contains only zeros
sin(x[i])
end
end
end
with_thread (generic function with 1 method)
julia>
julia> x = zeros(Int, 1_000_000);
julia> using BenchmarkTools
julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 2881 samples with 1 evaluation.
Range (min … max): 272.081 μs … 319.111 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 273.518 μs ┊ GC (median): 0.00%
Time (mean ± σ): 273.863 μs ± 2.094 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▅▇▂▄▄██▄▅▅▇▅▁▃▃▂
▆████████████████▆▆▇▆▇▇▆▅█▇▅▄▄▃▃▂▂▁▂▁▁▁▂▂▁▂▁▁▁▂▂▁▂▂▁▁▁▁▁▂▂▂▁▂ ▄
272 μs Histogram: frequency by time 280 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 2809 samples with 1 evaluation.
Range (min … max): 312.843 μs … 350.793 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 316.288 μs ┊ GC (median): 0.00%
Time (mean ± σ): 316.484 μs ± 2.223 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂ ▂▄▄▆▅▅▄▅▅▅▅▇▇▅▇███▆█▆▄▁▁▃▁▂▂▁▂ ▂ ▁ ▁ ▁
▃███▆▅▆▆▄▆██████████████████████████████████████▆█▇▄▄▃▃▃▂▃▄▄▂ ▆
313 μs Histogram: frequency by time 321 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 2063 samples with 1 evaluation.
Range (min … max): 313.300 μs … 392.157 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 359.390 μs ┊ GC (median): 0.00%
Time (mean ± σ): 358.813 μs ± 10.205 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▄▄▄█▄▅▄▇▅▅▄▃▂▂ ▁
▂▃▁▁▂▂▂▂▁▃▂▁▂▂▂▂▃▂▂▃▂▃▃▃▄▃▃▄▄▅▅▇███████████████████▆▆▅▄▄▃▃▃▃▃ ▄
313 μs Histogram: frequency by time 381 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
You're right. It's even worse for me:
julia> @benchmark reset_x!($x) setup=without_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 91.472 μs … 414.889 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 92.852 μs ┊ GC (median): 0.00%
Time (mean ± σ): 93.118 μs ± 3.508 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▃▅▅▆▆▇▇████▇▇▆▅▅▄▃▃▂▂▁▁ ▁ ▃
▆████████████████████████████▇▆▆█▇█▇█▇▇█▆▆▆▇▆▆▆▆▇▇███████▇█▇ █
91.5 μs Histogram: log(frequency) by time 97.7 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reset_x!($x) setup=with_batch(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 191.675 μs … 533.093 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 194.245 μs ┊ GC (median): 0.00%
Time (mean ± σ): 194.754 μs ± 3.916 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃▅▇███▅▄▂
▁▁▁▁▂▃▄▅███████████▆▅▄▃▂▂▂▂▂▂▂▂▁▂▂▂▂▂▁▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
192 μs Histogram: frequency by time 203 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark reset_x!($x) setup=with_thread(x)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 201.894 μs … 557.973 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 205.595 μs ┊ GC (median): 0.00%
Time (mean ± σ): 206.066 μs ± 6.377 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▃▅▆▆█▇██▅▆▄▂
▁▁▁▁▁▂▂▃▄▅▆███████████████▇▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▂▁▁▁▁▁▁▁ ▃
202 μs Histogram: frequency by time 213 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Where would I report this to then? JuliaLang/julia?
It may be interesting to uncover why. Possible ideas:
- Threads take a while to go to sleep; they can hog system resources until then.
- Clock speeds. Could mess with bios settings; I used to use the same speeds independent of the number of cores/disable boosting to mitigate this, but I believe I reverted to stock settings.
- Even though the code is not executed, perhaps prefetchers still fetch the memory, and this causes some synchronization overhead?
You could experiment by playing with https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/e7f2f4ba725cf8862f42cb34b83916e3561c15f8/src/threadtasks.jl#L24
for Polyester, or
https://docs.julialang.org/en/v1/manual/environment-variables/#JULIA_THREAD_SLEEP_THRESHOLD
for Threads.@threads.
I tried setting it to 1 or 0, but this did not help.
I tried this as well.
julia> @benchmark reset_x!($x) setup=(with_batch(x); ThreadingUtilities.sleep_all_tasks())
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 197.295 μs … 540.873 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 200.054 μs ┊ GC (median): 0.00%
Time (mean ± σ): 200.487 μs ± 3.949 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▅▆███▅▃▁
▁▁▁▁▂▂▃▄▅▇███████████▅▅▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
197 μs Histogram: frequency by time 208 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
Wouldn't that rule out your first idea?
Wouldn't that rule out your first idea?
Yes.
You could try counting cpu cycles with cpucycle.
I'll come back to this when I have more time. For now, I opened an issue in JuliaLang/julia.
Note that by default, Polyester only uses one thread per physical core.
You can change this (@batch per=thread).
You could also get @threads to only use one per core, by starting Julia with only one per core.
Those experiments could tell you if the Polyester.@batch vs Threads.@threads difference is only a function of the number of threads they're using.