Polyester.jl
Polyester.jl copied to clipboard
`@batch` slows down other non-`@batch`ed loops with allocations on macOS ARM
Some of my simulations are regularly stopping for about a second when using @batch on macOS ARM.
I could reduce this problem to this minimal example, but I am now clueless how to continue.
using Polyester
function with_batch()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
function without_batch()
for i in 1:2
nothing
end
v = [[]]
for i in 1:1000
v[1] = []
end
end
Benchmarking yields the following:
julia> @benchmark with_batch()
BenchmarkTools.Trial: 8709 samples with 1 evaluation.
Range (min … max): 16.416 μs … 1.404 s ┊ GC (min … max): 0.00% … 0.47%
Time (median): 18.041 μs ┊ GC (median): 0.00%
Time (mean ± σ): 663.460 μs ± 30.068 ms ┊ GC (mean ± σ): 0.41% ± 0.01%
▁▁▄▇█▇▅▃▁▁ ▁▂▂▃▄▄▄▂▃▃▃▃▃▃▄▄▃▄▃▂▁ ▁ ▂
▂▂▃▅▅▇███████████████████████████████████████████████████▇▆▆ █
16.4 μs Histogram: log(frequency) by time 23.6 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 14.625 μs … 5.596 ms ┊ GC (min … max): 0.00% … 99.31%
Time (median): 15.166 μs ┊ GC (median): 0.00%
Time (mean ± σ): 18.275 μs ± 110.414 μs ┊ GC (mean ± σ): 12.03% ± 1.98%
█▇▃
▂▂▅███▆▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▃▂▂▃▂▂▂▂▂▂▃▅▆▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
14.6 μs Histogram: frequency by time 19.8 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
About one execution out of 2000 takes over one second, which causes the mean to be 30x higher than without any @batch loops. This is consistent with what I see in simulations, where most time steps are fast, but then some take over a second.
This problem is specific to macOS ARM. The same Julia version on an x86 machine works as expected.
On an Intel laptop:
julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 17.436 μs … 73.138 ms ┊ GC (min … max): 0.00% … 2.37%
Time (median): 20.561 μs ┊ GC (median): 0.00%
Time (mean ± σ): 92.966 μs ± 2.244 ms ┊ GC (mean ± σ): 1.94% ± 0.08%
▂█▄▃▃▃▃▄▁
▁▁▆█████████▆▄▃▃▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁ ▂
17.4 μs Histogram: frequency by time 39.1 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.267 μs … 2.121 ms ┊ GC (min … max): 0.00% … 96.71%
Time (median): 19.250 μs ┊ GC (median): 0.00%
Time (mean ± σ): 22.940 μs ± 62.979 μs ┊ GC (mean ± σ): 8.50% ± 3.09%
▁▂▆██▇▇▇▇▇▆▅▄▃▃▃▂▁▁▁▁▁ ▁▂▁▂▂▁ ▁ ▁ ▁▁▁ ▃
███████████████████████▇███████████████████▇██▇▆▆▇▇▇▇▇▅▆▆▆▇ █
16.3 μs Histogram: log(frequency) by time 40.2 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
Not as extreme, but the problem still exists.
One workaround is to set a minbatch size:
julia> function with_minbatch()
# Just some loop with @batch with basically no runtime
@batch minbatch=100 for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
with_minbatch (generic function with 1 method)
julia> @benchmark with_minbatch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.096 μs … 2.231 ms ┊ GC (min … max): 0.00% … 98.34%
Time (median): 17.549 μs ┊ GC (median): 0.00%
Time (mean ± σ): 20.675 μs ± 63.241 μs ┊ GC (mean ± σ): 9.52% ± 3.09%
▁▃▅▇█▇▆▅▅▃▃▂▁▁ ▁ ▁▁▂▂▁ ▂
████████████████▇▆▆▆▆▄▂▅▂▄▅▅▆▄▄▅▄▃▄▂▆▇▇███████▇▆▆▅▄▄▅▃▅▇▆▆▅ █
16.1 μs Histogram: log(frequency) by time 33.2 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
This means we'd need at least 100 iterations per thread.
Thanks for the quick reply.
Unfortunately, this workaround does not work for me. I use @batch in the main loop of the simulation where I loop over thousands of particles. Then, another smaller loop that doesn't even use @batch because its performance is irrelevant compared to the main loop suddenly slows the whole simulation down significantly.
function with_batch_sleep()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
ThreadingUtilities.sleep_all_tasks()
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
I get
julia> @benchmark with_batch_sleep()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.542 μs … 1.063 ms ┊ GC (min … max): 0.00% … 90.75%
Time (median): 18.041 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.843 μs ± 31.948 μs ┊ GC (mean ± σ): 4.85% ± 2.99%
█▇ ▂▁▃▁ ▁
▆████████▇▇▆▆▃▅▅▅█▅▅▄▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▄▃▃▁▁▄▅ █
16.5 μs Histogram: log(frequency) by time 61.4 μs <
Memory estimate: 47.02 KiB, allocs estimate: 1003.
julia> versioninfo()
Julia Version 1.9.0-DEV.1073
Commit 0b9eda116d* (2022-08-01 14:27 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.5.0)
CPU: 8 × Apple M1
It's ridiculous that this is slow:
julia> function with_thread()
Threads.@threads for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
with_thread (generic function with 1 method)
julia> @benchmark with_thread()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 20.250 μs … 1.244 ms ┊ GC (min … max): 0.00% … 90.11%
Time (median): 52.875 μs ┊ GC (median): 0.00%
Time (mean ± σ): 55.113 μs ± 38.817 μs ┊ GC (mean ± σ): 2.25% ± 3.08%
▂ ▄█
▂▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▄▄▃▅█▇██▇▆▅▄▃▃▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂ ▃
20.2 μs Histogram: frequency by time 73.5 μs <
Memory estimate: 49.17 KiB, allocs estimate: 1025.
=/
I think ThreadingUtilities.sleep_all_tasks() should be exported by Polyester, and mentioned prominently in the README as the likely fix to any unexpected slowdowns.
Amazing, thank you! ThreadingUtilities.sleep_all_tasks() seems to be working on my system with ARM.
julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.417 μs … 5.805 ms ┊ GC (min … max): 0.00% … 99.06%
Time (median): 18.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 26.120 μs ± 114.690 μs ┊ GC (mean ± σ): 8.66% ± 1.98%
█▅▃▄▄▅▄▄▂▁ ▁ ▁▂▄▃▁ ▂
▇███████████▆▆▆▆█▇▇▇▅▅▅▄▃▃▃▄▁▃▄▃▃▄▄▄▄▅▅▄▃▃▄▄▁▄▃▃▃▄▄▅▅▆▆█████ █
16.4 μs Histogram: log(frequency) by time 64.2 μs <
Memory estimate: 47.02 KiB, allocs estimate: 1003.
Interestingly, with Threads.@threads, I only get the same slowdown of 3x as you, not the ridiculous factor of 30 that I get with @batch.
julia> @benchmark with_thread()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 23.625 μs … 6.863 ms ┊ GC (min … max): 0.00% … 97.57%
Time (median): 65.500 μs ┊ GC (median): 0.00%
Time (mean ± σ): 67.272 μs ± 135.149 μs ┊ GC (mean ± σ): 3.98% ± 1.97%
▄▆██▃
▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▄▄▅▅▅▅▅▄▄▄▆██████▇▆▅▅▄▅▄▄▃▃▃▃▃▃▃▂ ▃
23.6 μs Histogram: frequency by time 84.6 μs <
Memory estimate: 51.09 KiB, allocs estimate: 1051.
Polyester/ThreadingUtilities block excess threads for a few milliseconds while looking for work to do.
sleep_all_tasks makes them go to sleep.
Base threading does as well, but for not as long: https://github.com/JuliaLang/julia/blob/e1fa6a51e4142fbf25019b1c95ebc3b5b7f4e8a1/src/options.h#L129 16 microseconds, which is actually fast enough that on systems with many threads, waking them all at once means the first woken threads are already falling asleep by the time you get to the last ones.
But going to sleep more quickly can help other things, like here.
Presumably, something wants to run on these threads periodically.
You can change ThreadingUtilities default behavior: https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/3991a7e80781dafb9f1f77bf169c7da7a5d89981/src/threadtasks.jl#L24
I think we can close this issue once someone adds a section on the README (preferably close to the top, as it's an important gotcha).
PRs welcome :).
I would create a PR, but I still don't fully understand the problem that you explained in your last comment. How is the longer sleep threshold of Polyester problematic here? What is the consequence of threads falling asleep with a shorter threshold? Why is Polyester/ThreadingUtilities not doing that by default? And how can ThreadingUtilities' default behaviour be changed without modifying its code?
It could be simple and merely suggest trying it when you see unexpected regressions.
How is the longer sleep threshold of Polyester problematic here?
I am not sure why. This does make me think I perhaps need to decrease the threshold. The pattern is also interesting, because the median time seems fine. It's only occasionally extremely problematic.
This suggests that maybe only occasionally the loop wants to use another thread, perhaps related to GC, and when this happens, it has to wait for ThreadingUtilities' tasks to go to sleep.
What is the consequence of threads falling asleep with a shorter threshold?
If the threads are awake when you assign them work, e.g. through @batch, @tturbo, Octavian.matmul, or any code using these, they can start work immediately rather than having to wait to be woken up and scheduled.
Consider these benchmarks on an Intel (Cascadelake [i.e., Skylake-AVX512 clone]) CPU:
julia> function batch()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
end
batch (generic function with 1 method)
julia> function batch_sleep()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
ThreadingUtilities.sleep_all_tasks()
end
batch_sleep (generic function with 1 method)
julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 656 evaluations.
Range (min … max): 183.107 ns … 242.933 ns ┊ GC (min … max): 0.00% … 0.00%
Time (median): 185.463 ns ┊ GC (median): 0.00%
Time (mean ± σ): 188.037 ns ± 7.939 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▆█▇▄▅▄▄▅▄▂▂ ▁ ▁ ▂
█████████████▇▆▇▇▇▇▆▇▅▅▆▇▇▆▇▆▅▅▆▇▇▆▅▆▄▄▃▄▁▅▆▃▃▁▃▄▃▁▁▄▁▃▃███▇█ █
183 ns Histogram: log(frequency) by time 229 ns <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.652 μs … 4.131 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 1.772 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.792 μs ± 62.787 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▃█▆▁
▂▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▄▆████▇▅▅▄▃▃▄▆██▆▆▅▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
1.65 μs Histogram: frequency by time 1.96 μs <
Memory estimate: 28 bytes, allocs estimate: 0.
On a Zen3 CPU:
julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 38 evaluations.
Range (min … max): 880.237 ns … 10.576 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 920.763 ns ┊ GC (median): 0.00%
Time (mean ± σ): 927.305 ns ± 100.294 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▆█▅▃▂▁▁ ▂▁ ▁
▃▃▃▁▁▁▁▁▁▄▃▁▅▆▇██████████▇▇▇▇▇▇▆▇▆▆▆▅▅▅▆▅▄▆▄▇██▅▅▃▄▅▄▅▁▅▄▃▄▃▃ █
880 ns Histogram: log(frequency) by time 1.03 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.543 μs … 6.771 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 3.343 μs ┊ GC (median): 0.00%
Time (mean ± σ): 3.195 μs ± 329.152 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁▃▆▇█▆▄▂
▁▂▂███▅▃▄▄▄▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▄▅████████▇▆▄▃▂▂▂▁▁▁▁ ▃
2.54 μs Histogram: frequency by time 3.65 μs <
Memory estimate: 24 bytes, allocs estimate: 0.
And finally, on my M1:
julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
Range (min … max): 1.796 μs … 6.125 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.604 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.610 μs ± 70.689 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▂▃█▃ ▁
▃▁▁▁▁▁▁▃▁▁▁▁▃▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▄▁▁▁▁▁▄▆▁▁▁▆████▄▆▄▃▃▆▃▁▁▃▃▁█ █
1.8 μs Histogram: log(frequency) by time 2.87 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
Range (min … max): 2.218 μs … 8.690 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.597 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.606 μs ± 93.115 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▂█▂ ▁ ▁
▃▁▁▃▁▃▃▄▁▁▃▃▃▃▁▃▁▄▄█▅▄▁▅▃▅▅▄▄▆████▆▅▄▃▅▇███▇▆█▄▄▅▄▄▅█▃▄▁▁▄ █
2.22 μs Histogram: log(frequency) by time 2.89 μs <
Memory estimate: 28 bytes, allocs estimate: 0.
The M1 is much slower than the x86 CPUs here. I don't know if it's a problem with how ThreadingUtilities works on the M1, but I have known for a while that threading has substantially higher overhead on it than my x64 CPUs. So perhaps I should make it sleep far more quickly; sleeping and not doing so both took about 2.6 microseconds median, so there seems little benefit to it staying awake, vs the Intel and AMD CPUs which can shave off a good chunk of overhead if the pause between repeated threaded regions is brief.
And how can ThreadingUtilities' default behaviour be changed without modifying its code?
It cannot currently.
It's interesting that some evaluations take over a second longer in my initial example, even though the sleep timeout is just a millisecond. It seems like there is something preventing the threads from going to sleep, right?
Has anyone posted this to JuliaLang/julia yet, since it affects Threads.@threads as well?
Unfortunately, there still doesn't seem to be a good solution after over a year.
I tried integrating the sleep_all_tasks workaround into our codes, but I wasn't really successful. The only way to really get rid of the lagging was to define a macro that calls sleep_all_threads after EVERY @batch loop. But calling it this often slows down @batch loops.
using Polyester
using ThreadingUtilities
function with_batch()
# Just some loop with @batch with basically no runtime
@batch for i in 1:2
nothing
end
# This is just to make sure that the allocation in the next loop is not optimized away
v = [[]]
# Note that there is no @batch here
for i in 1:1000
# Just an allocation
v[1] = []
end
end
function with_batch_sleep()
@batch for i in 1:2
nothing
end
ThreadingUtilities.sleep_all_tasks()
v = [[]]
for i in 1:1000
v[1] = []
end
end
function batch_without_allocations()
@batch for i in 1:1000
i^3
end
end
function batch_sleep_without_allocations()
@batch for i in 1:1000
i^3
end
ThreadingUtilities.sleep_all_tasks()
end
julia> @benchmark with_batch()
BenchmarkTools.Trial: 3768 samples with 1 evaluation.
Range (min … max): 15.250 μs … 1.401 s ┊ GC (min … max): 0.00% … 0.25%
Time (median): 16.667 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.501 ms ± 45.537 ms ┊ GC (mean ± σ): 0.25% ± 0.01%
▃▅▆█▇▅▄▂ ▁
▃▄▂▄▆▄▆███████████▇▇▇▆▅▆▇▇▇▆▅▄▅▆▅▅▇▇▆▆▆▆▄▆▂▅▅▅▅▄▄▆▄▅▄▅▄▄▄▄▂ █
15.2 μs Histogram: log(frequency) by time 21.8 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> @benchmark with_batch_sleep()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 15.375 μs … 3.893 ms ┊ GC (min … max): 0.00% … 97.67%
Time (median): 18.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 21.760 μs ± 108.743 μs ┊ GC (mean ± σ): 15.47% ± 3.08%
▁▂▁ ▁▁▁▃▅▇█▆▄▂▂▁▁▂▂▂▁ ▁▁ ▁ ▁▁ ▁ ▂
▄▁▃▅▆▆█████████████████████████████████▇█▇█▇▇▇▇▇▆▆▆▆▆▆▅▆▆▆▃▃ █
15.4 μs Histogram: log(frequency) by time 24.4 μs <
Memory estimate: 47.02 KiB, allocs estimate: 1003.
julia> @benchmark batch_without_allocations() evals=100
BenchmarkTools.Trial: 10000 samples with 100 evaluations.
Range (min … max): 1.406 μs … 255.950 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 2.660 μs ┊ GC (median): 0.00%
Time (mean ± σ): 2.743 μs ± 2.831 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▁▁▁▂▃▄█▆▅▅▃▂▁▁ ▂
▅▅▄▆▇▇▇▆▇▇▆▆▆▇▆▇▆▇▇▆▇▇█▇▇█████████████████▇▇▆▅▅▄▃▅▄▄▁▅▄▄▄▃▄ █
1.41 μs Histogram: log(frequency) by time 3.61 μs <
Memory estimate: 0 bytes, allocs estimate: 0.
julia> @benchmark batch_sleep_without_allocations() evals=100
BenchmarkTools.Trial: 9092 samples with 100 evaluations.
Range (min … max): 4.000 μs … 65.567 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 5.366 μs ┊ GC (median): 0.00%
Time (mean ± σ): 5.492 μs ± 1.186 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
█
▂▅▇▆▆▆▅▄▃▄▄▃▄▄▄▅▄▅▅▅▄▃▂▂▂▃▆▅▃▃▂▂▂▂▂▁▂▂▁▂▂▂▂▂▃▄▄▄▇█▆▅▄▃▂▂▂▂ ▃
4 μs Histogram: frequency by time 7.13 μs <
Memory estimate: 159 bytes, allocs estimate: 4.
While sleep_all_tasks removes the ~1s runs that are destroying the mean and cause lagging in the simulations, it also slows down other threaded loops significantly.
Is there any better way by now?
It seems that this is fixed in 1.10?
julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 10 × Apple M2 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 8 on 6 virtual cores
julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.541 μs … 11.404 ms ┊ GC (min … max): 0.00% … 4.58%
Time (median): 14.000 μs ┊ GC (median): 0.00%
Time (mean ± σ): 25.751 μs ± 353.858 μs ┊ GC (mean ± σ): 1.78% ± 0.13%
▃▆█▆▆▅▂ ▁ ▁▁ ▁ ▁
█████████▇▆▇▇███████▇▇████████▇▆▆▅▆▄▆▅▅▅▄▅▄▅▄▅▅▅▅▃▄▅▅▅▄▄▆▆▅▃ █
13.5 μs Histogram: log(frequency) by time 23.4 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 12.875 μs … 682.167 μs ┊ GC (min … max): 0.00% … 95.69%
Time (median): 13.333 μs ┊ GC (median): 0.00%
Time (mean ± σ): 14.370 μs ± 20.023 μs ┊ GC (mean ± σ): 4.31% ± 3.04%
▃▅█▇▆▄▂ ▁
█████████▇▅▆▆▆▇█▇▇▇██████▇█▇▇▇▇▇▆▅▆▆▅▅▂▅▄▅▆▄▅▅▄▄▄▅▅▅▄▅▅▅▄▄▂▄ █
12.9 μs Histogram: log(frequency) by time 21.5 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 10 × Apple M2 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 6 on 6 virtual cores
julia> @benchmark with_batch()
BenchmarkTools.Trial: 3768 samples with 1 evaluation.
Range (min … max): 14.125 μs … 1.404 s ┊ GC (min … max): 0.00% … 0.05%
Time (median): 16.708 μs ┊ GC (median): 0.00%
Time (mean ± σ): 1.474 ms ± 44.685 ms ┊ GC (mean ± σ): 0.06% ± 0.00%
▄█▅▁▁▃▆█▂ ▁▁▂ ▁▂▂▁▁ ▂▁▁▁▃▃▄▃▃▁ ▁
▄▁▁▃█████████▅▆███████████████████████▇▆▄▅▅▆▃▆▅▅▆▇▆▅▅▅▇▆▆▅▆ █
14.1 μs Histogram: log(frequency) by time 28 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 13.458 μs … 738.375 μs ┊ GC (min … max): 0.00% … 96.14%
Time (median): 13.750 μs ┊ GC (median): 0.00%
Time (mean ± σ): 14.783 μs ± 19.694 μs ┊ GC (mean ± σ): 4.10% ± 3.02%
▅██▄▂▁▁ ▃▃▃▁ ▂
████████▆██▇▇▆██████▆▇▆▅▅██▇███▇▆█▆▆▅▅▆▅▆▃▄▅▅▄▄▅▄▅▆▄▄▅▅▄▁▄▃▄ █
13.5 μs Histogram: log(frequency) by time 22 μs <
Memory estimate: 46.98 KiB, allocs estimate: 1002.
I ran the code in the very first post.