Polyester.jl `@batch` slows down other non-`@batch`ed loops with allocations on macOS ARM

Some of my simulations are regularly stopping for about a second when using @batch on macOS ARM. I could reduce this problem to this minimal example, but I am now clueless how to continue.

using Polyester


function with_batch()
    # Just some loop with @batch with basically no runtime
    @batch for i in 1:2
        nothing
    end

    # This is just to make sure that the allocation in the next loop is not optimized away
    v = [[]]

    # Note that there is no @batch here
    for i in 1:1000
        # Just an allocation
        v[1] = []
    end
end

function without_batch()
    for i in 1:2
        nothing
    end

    v = [[]]

    for i in 1:1000
        v[1] = []
    end
end

Benchmarking yields the following:

julia> @benchmark with_batch()
BenchmarkTools.Trial: 8709 samples with 1 evaluation.
 Range (min … max):   16.416 μs …   1.404 s  ┊ GC (min … max): 0.00% … 0.47%
 Time  (median):      18.041 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   663.460 μs ± 30.068 ms  ┊ GC (mean ± σ):  0.41% ± 0.01%

         ▁▁▄▇█▇▅▃▁▁ ▁▂▂▃▄▄▄▂▃▃▃▃▃▃▄▄▃▄▃▂▁      ▁               ▂
  ▂▂▃▅▅▇███████████████████████████████████████████████████▇▆▆ █
  16.4 μs       Histogram: log(frequency) by time      23.6 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  14.625 μs …   5.596 ms  ┊ GC (min … max):  0.00% … 99.31%
 Time  (median):     15.166 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   18.275 μs ± 110.414 μs  ┊ GC (mean ± σ):  12.03% ±  1.98%

     █▇▃                                                        
  ▂▂▅███▆▃▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▅▃▂▂▃▂▂▂▂▂▂▃▅▆▄▃▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  14.6 μs         Histogram: frequency by time         19.8 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

About one execution out of 2000 takes over one second, which causes the mean to be 30x higher than without any @batch loops. This is consistent with what I see in simulations, where most time steps are fast, but then some take over a second.

This problem is specific to macOS ARM. The same Julia version on an x86 machine works as expected.

Aug 13 '22 11:08 efaulhaber

On an Intel laptop:

julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  17.436 μs … 73.138 ms  ┊ GC (min … max): 0.00% … 2.37%
 Time  (median):     20.561 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.966 μs ±  2.244 ms  ┊ GC (mean ± σ):  1.94% ± 0.08%

     ▂█▄▃▃▃▃▄▁
  ▁▁▆█████████▆▄▃▃▂▂▂▁▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁ ▂
  17.4 μs         Histogram: frequency by time        39.1 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.267 μs …  2.121 ms  ┊ GC (min … max): 0.00% … 96.71%
 Time  (median):     19.250 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   22.940 μs ± 62.979 μs  ┊ GC (mean ± σ):  8.50% ±  3.09%

  ▁▂▆██▇▇▇▇▇▆▅▄▃▃▃▂▁▁▁▁▁      ▁▂▁▂▂▁ ▁ ▁  ▁▁▁                 ▃
  ███████████████████████▇███████████████████▇██▇▆▆▇▇▇▇▇▅▆▆▆▇ █
  16.3 μs      Histogram: log(frequency) by time      40.2 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

Not as extreme, but the problem still exists.

Aug 13 '22 13:08 chriselrod

One workaround is to set a minbatch size:

julia> function with_minbatch()
           # Just some loop with @batch with basically no runtime
           @batch minbatch=100 for i in 1:2
               nothing
           end

           # This is just to make sure that the allocation in the next loop is not optimized away
           v = [[]]

           # Note that there is no @batch here
           for i in 1:1000
               # Just an allocation
               v[1] = []
           end
       end
with_minbatch (generic function with 1 method)

julia> @benchmark with_minbatch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.096 μs …  2.231 ms  ┊ GC (min … max): 0.00% … 98.34%
 Time  (median):     17.549 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   20.675 μs ± 63.241 μs  ┊ GC (mean ± σ):  9.52% ±  3.09%

  ▁▃▅▇█▇▆▅▅▃▃▂▁▁ ▁                        ▁▁▂▂▁               ▂
  ████████████████▇▆▆▆▆▄▂▅▂▄▅▅▆▄▄▅▄▃▄▂▆▇▇███████▇▆▆▅▄▄▅▃▅▇▆▆▅ █
  16.1 μs      Histogram: log(frequency) by time      33.2 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

This means we'd need at least 100 iterations per thread.

Aug 13 '22 13:08 chriselrod

Thanks for the quick reply. Unfortunately, this workaround does not work for me. I use @batch in the main loop of the simulation where I loop over thousands of particles. Then, another smaller loop that doesn't even use @batch because its performance is irrelevant compared to the main loop suddenly slows the whole simulation down significantly.

Aug 13 '22 13:08 efaulhaber

function with_batch_sleep()
    # Just some loop with @batch with basically no runtime
    @batch for i in 1:2
        nothing
    end
    ThreadingUtilities.sleep_all_tasks()
    # This is just to make sure that the allocation in the next loop is not optimized away
    v = [[]]

    # Note that there is no @batch here
    for i in 1:1000
        # Just an allocation
        v[1] = []
    end
end

I get

julia> @benchmark with_batch_sleep()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.542 μs …  1.063 ms  ┊ GC (min … max): 0.00% … 90.75%
 Time  (median):     18.041 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.843 μs ± 31.948 μs  ┊ GC (mean ± σ):  4.85% ±  2.99%

   █▇  ▂▁▃▁                                                   ▁
  ▆████████▇▇▆▆▃▅▅▅█▅▅▄▃▁▁▃▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▁▁▄▄▃▃▁▁▄▅ █
  16.5 μs      Histogram: log(frequency) by time      61.4 μs <

 Memory estimate: 47.02 KiB, allocs estimate: 1003.

julia> versioninfo()
Julia Version 1.9.0-DEV.1073
Commit 0b9eda116d* (2022-08-01 14:27 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.5.0)
  CPU: 8 × Apple M1

Aug 13 '22 13:08 chriselrod

It's ridiculous that this is slow:

julia> function with_thread()
           Threads.@threads for i in 1:2
               nothing
           end
           # This is just to make sure that the allocation in the next loop is not optimized away
           v = [[]]
           # Note that there is no @batch here
           for i in 1:1000
               # Just an allocation
               v[1] = []
           end
       end
with_thread (generic function with 1 method)

julia> @benchmark with_thread()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  20.250 μs …  1.244 ms  ┊ GC (min … max): 0.00% … 90.11%
 Time  (median):     52.875 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   55.113 μs ± 38.817 μs  ┊ GC (mean ± σ):  2.25% ±  3.08%

                                   ▂ ▄█                        
  ▂▁▁▁▁▁▂▂▁▁▁▁▁▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▄▄▃▅█▇██▇▆▅▄▃▃▃▃▄▄▃▃▃▂▂▂▂▂▂▂▂▂ ▃
  20.2 μs         Histogram: frequency by time        73.5 μs <

 Memory estimate: 49.17 KiB, allocs estimate: 1025.

=/

Aug 13 '22 13:08 chriselrod

I think ThreadingUtilities.sleep_all_tasks() should be exported by Polyester, and mentioned prominently in the README as the likely fix to any unexpected slowdowns.

Aug 13 '22 13:08 chriselrod

Amazing, thank you! ThreadingUtilities.sleep_all_tasks() seems to be working on my system with ARM.

julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.417 μs …   5.805 ms  ┊ GC (min … max): 0.00% … 99.06%
 Time  (median):     18.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   26.120 μs ± 114.690 μs  ┊ GC (mean ± σ):  8.66% ±  1.98%

   █▅▃▄▄▅▄▄▂▁     ▁                                      ▁▂▄▃▁ ▂
  ▇███████████▆▆▆▆█▇▇▇▅▅▅▄▃▃▃▄▁▃▄▃▃▄▄▄▄▅▅▄▃▃▄▄▁▄▃▃▃▄▄▅▅▆▆█████ █
  16.4 μs       Histogram: log(frequency) by time      64.2 μs <

 Memory estimate: 47.02 KiB, allocs estimate: 1003.

Interestingly, with Threads.@threads, I only get the same slowdown of 3x as you, not the ridiculous factor of 30 that I get with @batch.

julia> @benchmark with_thread()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  23.625 μs …   6.863 ms  ┊ GC (min … max): 0.00% … 97.57%
 Time  (median):     65.500 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   67.272 μs ± 135.149 μs  ┊ GC (mean ± σ):  3.98% ±  1.97%

                                         ▄▆██▃                  
  ▂▁▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▃▃▄▄▄▅▅▅▅▅▄▄▄▆██████▇▆▅▅▄▅▄▄▃▃▃▃▃▃▃▂ ▃
  23.6 μs         Histogram: frequency by time         84.6 μs <

 Memory estimate: 51.09 KiB, allocs estimate: 1051.

Aug 13 '22 14:08 efaulhaber

Polyester/ThreadingUtilities block excess threads for a few milliseconds while looking for work to do. sleep_all_tasks makes them go to sleep.

Base threading does as well, but for not as long: https://github.com/JuliaLang/julia/blob/e1fa6a51e4142fbf25019b1c95ebc3b5b7f4e8a1/src/options.h#L129 16 microseconds, which is actually fast enough that on systems with many threads, waking them all at once means the first woken threads are already falling asleep by the time you get to the last ones.

But going to sleep more quickly can help other things, like here.

Presumably, something wants to run on these threads periodically.

You can change ThreadingUtilities default behavior: https://github.com/JuliaSIMD/ThreadingUtilities.jl/blob/3991a7e80781dafb9f1f77bf169c7da7a5d89981/src/threadtasks.jl#L24

Aug 13 '22 14:08 chriselrod

I think we can close this issue once someone adds a section on the README (preferably close to the top, as it's an important gotcha).

PRs welcome :).

Aug 13 '22 19:08 chriselrod

I would create a PR, but I still don't fully understand the problem that you explained in your last comment. How is the longer sleep threshold of Polyester problematic here? What is the consequence of threads falling asleep with a shorter threshold? Why is Polyester/ThreadingUtilities not doing that by default? And how can ThreadingUtilities' default behaviour be changed without modifying its code?

Aug 13 '22 19:08 efaulhaber

It could be simple and merely suggest trying it when you see unexpected regressions.

How is the longer sleep threshold of Polyester problematic here?

I am not sure why. This does make me think I perhaps need to decrease the threshold. The pattern is also interesting, because the median time seems fine. It's only occasionally extremely problematic.

This suggests that maybe only occasionally the loop wants to use another thread, perhaps related to GC, and when this happens, it has to wait for ThreadingUtilities' tasks to go to sleep.

What is the consequence of threads falling asleep with a shorter threshold?

If the threads are awake when you assign them work, e.g. through @batch, @tturbo, Octavian.matmul, or any code using these, they can start work immediately rather than having to wait to be woken up and scheduled.

Consider these benchmarks on an Intel (Cascadelake [i.e., Skylake-AVX512 clone]) CPU:

julia> function batch()
           # Just some loop with @batch with basically no runtime
           @batch for i in 1:2
               nothing
           end
       end
batch (generic function with 1 method)

julia> function batch_sleep()
           # Just some loop with @batch with basically no runtime
           @batch for i in 1:2
               nothing
           end
           ThreadingUtilities.sleep_all_tasks()
       end
batch_sleep (generic function with 1 method)

julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 656 evaluations.
 Range (min … max):  183.107 ns … 242.933 ns  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     185.463 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   188.037 ns ±   7.939 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▁▆█▇▄▅▄▄▅▄▂▂                                             ▁  ▁ ▂
  █████████████▇▆▇▇▇▇▆▇▅▅▆▇▇▆▇▆▅▅▆▇▇▆▅▆▄▄▃▄▁▅▆▃▃▁▃▄▃▁▁▄▁▃▃███▇█ █
  183 ns        Histogram: log(frequency) by time        229 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.652 μs …  4.131 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.772 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.792 μs ± 62.787 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                    ▃█▆▁                                      
  ▂▁▁▁▁▂▂▂▂▂▂▂▃▃▃▃▄▆████▇▅▅▄▃▃▄▆██▆▆▅▄▃▃▂▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂ ▃
  1.65 μs        Histogram: frequency by time        1.96 μs <

 Memory estimate: 28 bytes, allocs estimate: 0.

On a Zen3 CPU:

julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 38 evaluations.
 Range (min … max):  880.237 ns …  10.576 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     920.763 ns               ┊ GC (median):    0.00%
 Time  (mean ± σ):   927.305 ns ± 100.294 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                 ▆█▅▃▂▁▁                       ▂▁               ▁
  ▃▃▃▁▁▁▁▁▁▄▃▁▅▆▇██████████▇▇▇▇▇▇▆▇▆▆▆▅▅▅▆▅▄▆▄▇██▅▅▃▄▅▄▅▁▅▄▃▄▃▃ █
  880 ns        Histogram: log(frequency) by time       1.03 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.543 μs …   6.771 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     3.343 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.195 μs ± 329.152 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     ▂                                    ▁▃▆▇█▆▄▂             
  ▁▂▂███▅▃▄▄▄▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▂▂▂▃▃▄▅████████▇▆▄▃▂▂▂▁▁▁▁ ▃
  2.54 μs         Histogram: frequency by time        3.65 μs <

 Memory estimate: 24 bytes, allocs estimate: 0.

And finally, on my M1:

julia> @benchmark batch()
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.796 μs …  6.125 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.604 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.610 μs ± 70.689 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                           ▂▃█▃              ▁
  ▃▁▁▁▁▁▁▃▁▁▁▁▃▁▃▁▁▃▁▁▁▁▃▁▁▁▁▁▁▄▁▁▁▁▁▄▆▁▁▁▆████▄▆▄▃▃▆▃▁▁▃▃▁█ █
  1.8 μs       Histogram: log(frequency) by time     2.87 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark batch_sleep()
BenchmarkTools.Trial: 10000 samples with 9 evaluations.
 Range (min … max):  2.218 μs …  8.690 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.597 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.606 μs ± 93.115 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                ▁▂█▂        ▁                ▁
  ▃▁▁▃▁▃▃▄▁▁▃▃▃▃▁▃▁▄▄█▅▄▁▅▃▅▅▄▄▆████▆▅▄▃▅▇███▇▆█▄▄▅▄▄▅█▃▄▁▁▄ █
  2.22 μs      Histogram: log(frequency) by time     2.89 μs <

 Memory estimate: 28 bytes, allocs estimate: 0.

The M1 is much slower than the x86 CPUs here. I don't know if it's a problem with how ThreadingUtilities works on the M1, but I have known for a while that threading has substantially higher overhead on it than my x64 CPUs. So perhaps I should make it sleep far more quickly; sleeping and not doing so both took about 2.6 microseconds median, so there seems little benefit to it staying awake, vs the Intel and AMD CPUs which can shave off a good chunk of overhead if the pause between repeated threaded regions is brief.

And how can ThreadingUtilities' default behaviour be changed without modifying its code?

It cannot currently.

Aug 14 '22 02:08 chriselrod

It's interesting that some evaluations take over a second longer in my initial example, even though the sleep timeout is just a millisecond. It seems like there is something preventing the threads from going to sleep, right?

Aug 14 '22 06:08 efaulhaber

Has anyone posted this to JuliaLang/julia yet, since it affects Threads.@threads as well?

May 31 '23 13:05 efaulhaber

Unfortunately, there still doesn't seem to be a good solution after over a year. I tried integrating the sleep_all_tasks workaround into our codes, but I wasn't really successful. The only way to really get rid of the lagging was to define a macro that calls sleep_all_threads after EVERY @batch loop. But calling it this often slows down @batch loops.

using Polyester
using ThreadingUtilities

function with_batch()
    # Just some loop with @batch with basically no runtime
    @batch for i in 1:2
        nothing
    end

    # This is just to make sure that the allocation in the next loop is not optimized away
    v = [[]]

    # Note that there is no @batch here
    for i in 1:1000
        # Just an allocation
        v[1] = []
    end
end

function with_batch_sleep()
    @batch for i in 1:2
        nothing
    end
    ThreadingUtilities.sleep_all_tasks()
    v = [[]]
    for i in 1:1000
        v[1] = []
    end
end

function batch_without_allocations()
    @batch for i in 1:1000
        i^3
    end
end

function batch_sleep_without_allocations()
    @batch for i in 1:1000
        i^3
    end
    ThreadingUtilities.sleep_all_tasks()
end

julia> @benchmark with_batch()
BenchmarkTools.Trial: 3768 samples with 1 evaluation.
 Range (min … max):  15.250 μs …   1.401 s  ┊ GC (min … max): 0.00% … 0.25%
 Time  (median):     16.667 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    1.501 ms ± 45.537 ms  ┊ GC (mean ± σ):  0.25% ± 0.01%

           ▃▅▆█▇▅▄▂                                           ▁
  ▃▄▂▄▆▄▆███████████▇▇▇▆▅▆▇▇▇▆▅▄▅▆▅▅▇▇▆▆▆▆▄▆▂▅▅▅▅▄▄▆▄▅▄▅▄▄▄▄▂ █
  15.2 μs      Histogram: log(frequency) by time      21.8 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

julia> @benchmark with_batch_sleep()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  15.375 μs …   3.893 ms  ┊ GC (min … max):  0.00% … 97.67%
 Time  (median):     18.000 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   21.760 μs ± 108.743 μs  ┊ GC (mean ± σ):  15.47% ±  3.08%

         ▁▂▁ ▁▁▁▃▅▇█▆▄▂▂▁▁▂▂▂▁ ▁▁ ▁ ▁▁ ▁                       ▂
  ▄▁▃▅▆▆█████████████████████████████████▇█▇█▇▇▇▇▇▆▆▆▆▆▆▅▆▆▆▃▃ █
  15.4 μs       Histogram: log(frequency) by time      24.4 μs <

 Memory estimate: 47.02 KiB, allocs estimate: 1003.

julia> @benchmark batch_without_allocations() evals=100
BenchmarkTools.Trial: 10000 samples with 100 evaluations.
 Range (min … max):  1.406 μs … 255.950 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     2.660 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   2.743 μs ±   2.831 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                             ▁▁▁▂▃▄█▆▅▅▃▂▁▁                   ▂
  ▅▅▄▆▇▇▇▆▇▇▆▆▆▇▆▇▆▇▇▆▇▇█▇▇█████████████████▇▇▆▅▅▄▃▅▄▄▁▅▄▄▄▃▄ █
  1.41 μs      Histogram: log(frequency) by time      3.61 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark batch_sleep_without_allocations() evals=100
BenchmarkTools.Trial: 9092 samples with 100 evaluations.
 Range (min … max):  4.000 μs … 65.567 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     5.366 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.492 μs ±  1.186 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                                   █          
  ▂▅▇▆▆▆▅▄▃▄▄▃▄▄▄▅▄▅▅▅▄▃▂▂▂▃▆▅▃▃▂▂▂▂▂▁▂▂▁▂▂▂▂▂▃▄▄▄▇█▆▅▄▃▂▂▂▂ ▃
  4 μs           Histogram: frequency by time        7.13 μs <

 Memory estimate: 159 bytes, allocs estimate: 4.

While sleep_all_tasks removes the ~1s runs that are destroying the mean and cause lagging in the simulations, it also slows down other threaded loops significantly.

Is there any better way by now?

Nov 08 '23 16:11 efaulhaber

It seems that this is fixed in 1.10?

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39b (2023-12-25 18:01 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M2 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
  Threads: 8 on 6 virtual cores

julia> @benchmark with_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.541 μs …  11.404 ms  ┊ GC (min … max): 0.00% … 4.58%
 Time  (median):     14.000 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   25.751 μs ± 353.858 μs  ┊ GC (mean ± σ):  1.78% ± 0.13%

  ▃▆█▆▆▅▂        ▁      ▁▁  ▁                                  ▁
  █████████▇▆▇▇███████▇▇████████▇▆▆▅▆▄▆▅▅▅▄▅▄▅▄▅▅▅▅▃▄▅▅▅▄▄▆▆▅▃ █
  13.5 μs       Histogram: log(frequency) by time      23.4 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  12.875 μs … 682.167 μs  ┊ GC (min … max): 0.00% … 95.69%
 Time  (median):     13.333 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.370 μs ±  20.023 μs  ┊ GC (mean ± σ):  4.31% ±  3.04%

  ▃▅█▇▆▄▂                                                      ▁
  █████████▇▅▆▆▆▇█▇▇▇██████▇█▇▇▇▇▇▆▅▆▆▅▅▂▅▄▅▆▄▅▅▄▄▄▅▅▅▄▅▅▅▄▄▂▄ █
  12.9 μs       Histogram: log(frequency) by time      21.5 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

julia> versioninfo()
Julia Version 1.9.4
Commit 8e5136fa297 (2023-11-14 08:46 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M2 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 6 on 6 virtual cores

julia> @benchmark with_batch()
BenchmarkTools.Trial: 3768 samples with 1 evaluation.
 Range (min … max):  14.125 μs …   1.404 s  ┊ GC (min … max): 0.00% … 0.05%
 Time  (median):     16.708 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):    1.474 ms ± 44.685 ms  ┊ GC (mean ± σ):  0.06% ± 0.00%

      ▄█▅▁▁▃▆█▂    ▁▁▂ ▁▂▂▁▁ ▂▁▁▁▃▃▄▃▃▁                       ▁
  ▄▁▁▃█████████▅▆███████████████████████▇▆▄▅▅▆▃▆▅▅▆▇▆▅▅▅▇▆▆▅▆ █
  14.1 μs      Histogram: log(frequency) by time        28 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

julia> @benchmark without_batch()
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  13.458 μs … 738.375 μs  ┊ GC (min … max): 0.00% … 96.14%
 Time  (median):     13.750 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   14.783 μs ±  19.694 μs  ┊ GC (mean ± σ):  4.10% ±  3.02%

  ▅██▄▂▁▁       ▃▃▃▁                                           ▂
  ████████▆██▇▇▆██████▆▇▆▅▅██▇███▇▆█▆▆▅▅▆▅▆▃▄▅▅▄▄▅▄▅▆▄▄▅▅▄▁▄▃▄ █
  13.5 μs       Histogram: log(frequency) by time        22 μs <

 Memory estimate: 46.98 KiB, allocs estimate: 1002.

I ran the code in the very first post.

Jan 02 '24 11:01 efaulhaber

Polyester.jl Polyester.jl copied to clipboard

`@batch` slows down other non-`@batch`ed loops with allocations on macOS ARM

Polyester.jl
Polyester.jl copied to clipboard