Trixi.jl
Trixi.jl copied to clipboard
`AMRCallback` non-deterministically slow with multiple threads on macOS ARM
julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_blob_amr.jl", tspan=(0.0, 0.1))
[...]
julia> u_ode = copy(sol[end]);
julia> @benchmark $(amr_callback.affect!)($u_ode, $semi, $0.0, $100)
BenchmarkTools.Trial: 950 samples with 1 evaluation.
Range (min … max): 246.334 μs … 1.404 s ┊ GC (min … max): 0.00% … 0.74%
Time (median): 279.062 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.163 ms ± 90.551 ms ┊ GC (mean ± σ): 0.58% ± 0.04%
▁▁ ▅▁ ▁▂▅▆█▇▄▃
▄██████▇▆▄▄▄▃▃▃▄▇████████▆▆▆▆█▆▅▅▄▄▄▄▄▃▃▂▃▃▃▂▂▂▂▂▂▁▁▂▁▂▁▁▁▁▂ ▄
246 μs Histogram: frequency by time 342 μs <
Memory estimate: 640.44 KiB, allocs estimate: 117.
About 4 executions out of 100 are extremely slow (over 1s vs 250µs). This is consistent with what I see in the simulation. It runs for about 200 time steps, and then it freezes for a second or so. That causes AMR to take over 80% of the simulation time.
One thread:
trixi_include("examples/tree_2d_dgsem/elixir_euler_blob_amr.jl", tspan=(0.0, 0.5))
[...]
────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished. Final time: 0.5 Time steps: 644 (accepted), 644 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 3.35s / 97.1% 512MiB / 99.9%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────────────
rhs! 3.22k 2.80s 86.1% 870μs 11.5KiB 0.0% 3.67B
volume integral 3.22k 2.58s 79.2% 800μs 2.20KiB 0.0% 0.70B
blended DG-FV 3.22k 2.45s 75.2% 760μs 0.00B 0.0% 0.00B
blending factors 3.22k 123ms 3.8% 38.0μs 0.00B 0.0% 0.00B
~volume integral~ 3.22k 6.49ms 0.2% 2.01μs 2.20KiB 0.0% 0.70B
pure DG 3.22k 40.2μs 0.0% 12.5ns 0.00B 0.0% 0.00B
interface flux 3.22k 97.4ms 3.0% 30.2μs 0.00B 0.0% 0.00B
surface integral 3.22k 34.9ms 1.1% 10.8μs 0.00B 0.0% 0.00B
mortar flux 3.22k 27.8ms 0.9% 8.64μs 0.00B 0.0% 0.00B
prolong2interfaces 3.22k 27.4ms 0.8% 8.51μs 0.00B 0.0% 0.00B
Jacobian 3.22k 17.4ms 0.5% 5.41μs 0.00B 0.0% 0.00B
prolong2mortars 3.22k 12.1ms 0.4% 3.76μs 0.00B 0.0% 0.00B
reset ∂u/∂t 3.22k 5.41ms 0.2% 1.68μs 0.00B 0.0% 0.00B
~rhs!~ 3.22k 2.56ms 0.1% 795ns 9.33KiB 0.0% 2.97B
prolong2boundaries 3.22k 81.9μs 0.0% 25.4ns 0.00B 0.0% 0.00B
source terms 3.22k 42.5μs 0.0% 13.2ns 0.00B 0.0% 0.00B
boundary flux 3.22k 32.9μs 0.0% 10.2ns 0.00B 0.0% 0.00B
AMR 643 310ms 9.5% 483μs 451MiB 88.2% 719KiB
coarsen 643 157ms 4.8% 244μs 236MiB 46.1% 375KiB
solver 643 96.1ms 3.0% 150μs 216MiB 42.2% 343KiB
mesh 643 53.3ms 1.6% 82.9μs 3.08MiB 0.6% 4.91KiB
~coarsen~ 643 7.14ms 0.2% 11.1μs 17.1MiB 3.3% 27.2KiB
refine 643 103ms 3.2% 161μs 207MiB 40.4% 329KiB
solver 639 62.7ms 1.9% 98.1μs 202MiB 39.5% 324KiB
mesh 639 40.4ms 1.2% 63.2μs 4.61MiB 0.9% 7.39KiB
refine_unbalanced! 639 30.7ms 0.9% 48.1μs 68.6KiB 0.0% 110B
rebalance! 924 8.26ms 0.3% 8.94μs 1.25MiB 0.2% 1.39KiB
~mesh~ 639 1.40ms 0.0% 2.19μs 3.29MiB 0.6% 5.27KiB
~refine~ 643 422μs 0.0% 656ns 148KiB 0.0% 236B
indicator 643 47.6ms 1.5% 74.0μs 4.08MiB 0.8% 6.50KiB
~AMR~ 643 2.72ms 0.1% 4.23μs 4.73MiB 0.9% 7.53KiB
analyze solution 8 97.1ms 3.0% 12.1ms 43.2MiB 8.4% 5.40MiB
I/O 9 24.1ms 0.7% 2.68ms 16.7MiB 3.3% 1.85MiB
save solution 8 13.7ms 0.4% 1.71ms 16.0MiB 3.1% 2.00MiB
save mesh 8 7.88ms 0.2% 985μs 289KiB 0.1% 36.2KiB
get element variables 8 1.38ms 0.0% 173μs 118KiB 0.0% 14.8KiB
~I/O~ 9 1.14ms 0.0% 127μs 306KiB 0.1% 34.0KiB
calculate dt 645 19.9ms 0.6% 30.9μs 0.00B 0.0% 0.00B
initial condition AMR 1 863μs 0.0% 863μs 264KiB 0.1% 264KiB
AMR 1 860μs 0.0% 860μs 263KiB 0.1% 263KiB
indicator 1 819μs 0.0% 819μs 64.1KiB 0.0% 64.1KiB
~AMR~ 1 41.0μs 0.0% 41.0μs 199KiB 0.0% 199KiB
refine 1 209ns 0.0% 209ns 64.0B 0.0% 64.0B
coarsen 1 125ns 0.0% 125ns 64.0B 0.0% 64.0B
~initial condition AMR~ 1 2.54μs 0.0% 2.54μs 752B 0.0% 752B
──────────────────────────────────────────────────────────────────────────────────────
Multiple threads:
julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_blob_amr.jl", tspan=(0.0, 0.5))
[...]
────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished. Final time: 0.5 Time steps: 644 (accepted), 644 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────
──────────────────────────────────────────────────────────────────────────────────────
Trixi.jl Time Allocations
─────────────────────── ────────────────────────
Tot / % measured: 5.89s / 93.1% 524MiB / 99.9%
Section ncalls time %tot avg alloc %tot avg
──────────────────────────────────────────────────────────────────────────────────────
AMR 643 4.44s 81.0% 6.91ms 452MiB 86.4% 720KiB
coarsen 643 2.89s 52.7% 4.49ms 236MiB 45.1% 375KiB
solver 643 1.45s 26.5% 2.26ms 216MiB 41.2% 343KiB
mesh 643 1.43s 26.1% 2.23ms 3.09MiB 0.6% 4.91KiB
~coarsen~ 643 4.64ms 0.1% 7.22μs 17.1MiB 3.3% 27.2KiB
refine 643 1.52s 27.8% 2.37ms 207MiB 39.5% 329KiB
solver 639 1.48s 26.9% 2.31ms 202MiB 38.6% 324KiB
mesh 639 46.4ms 0.8% 72.6μs 4.61MiB 0.9% 7.39KiB
refine_unbalanced! 639 35.3ms 0.6% 55.3μs 68.6KiB 0.0% 110B
rebalance! 924 8.94ms 0.2% 9.68μs 1.25MiB 0.2% 1.39KiB
~mesh~ 639 2.12ms 0.0% 3.31μs 3.29MiB 0.6% 5.27KiB
~refine~ 643 500μs 0.0% 778ns 148KiB 0.0% 236B
indicator 643 27.8ms 0.5% 43.2μs 4.85MiB 0.9% 7.72KiB
~AMR~ 643 3.43ms 0.1% 5.33μs 4.73MiB 0.9% 7.53KiB
rhs! 3.22k 925ms 16.9% 287μs 11.0MiB 2.1% 3.48KiB
volume integral 3.22k 703ms 12.8% 218μs 3.05MiB 0.6% 993B
blended DG-FV 3.22k 634ms 11.6% 197μs 1.52MiB 0.3% 496B
blending factors 3.22k 61.1ms 1.1% 19.0μs 1.52MiB 0.3% 496B
~volume integral~ 3.22k 7.40ms 0.1% 2.30μs 2.20KiB 0.0% 0.70B
pure DG 3.22k 57.1μs 0.0% 17.7ns 0.00B 0.0% 0.00B
reset ∂u/∂t 3.22k 101ms 1.8% 31.4μs 0.00B 0.0% 0.00B
interface flux 3.22k 37.0ms 0.7% 11.5μs 1.38MiB 0.3% 448B
prolong2interfaces 3.22k 22.9ms 0.4% 7.12μs 1.03MiB 0.2% 336B
mortar flux 3.22k 18.5ms 0.3% 5.74μs 1.76MiB 0.3% 574B
surface integral 3.22k 16.1ms 0.3% 4.99μs 1.13MiB 0.2% 368B
prolong2mortars 3.22k 13.7ms 0.2% 4.25μs 1.37MiB 0.3% 446B
Jacobian 3.22k 10.0ms 0.2% 3.09μs 1.23MiB 0.2% 400B
~rhs!~ 3.22k 2.63ms 0.0% 817ns 9.33KiB 0.0% 2.97B
prolong2boundaries 3.22k 74.1μs 0.0% 23.0ns 0.00B 0.0% 0.00B
boundary flux 3.22k 27.7μs 0.0% 8.59ns 0.00B 0.0% 0.00B
source terms 3.22k 26.6μs 0.0% 8.27ns 0.00B 0.0% 0.00B
analyze solution 8 65.9ms 1.2% 8.23ms 43.2MiB 8.3% 5.40MiB
I/O 9 28.6ms 0.5% 3.18ms 16.7MiB 3.2% 1.86MiB
save solution 8 16.2ms 0.3% 2.02ms 16.0MiB 3.1% 2.00MiB
save mesh 8 10.0ms 0.2% 1.25ms 289KiB 0.1% 36.1KiB
~I/O~ 9 1.73ms 0.0% 192μs 306KiB 0.1% 34.0KiB
get element variables 8 725μs 0.0% 90.6μs 126KiB 0.0% 15.8KiB
calculate dt 645 21.4ms 0.4% 33.1μs 0.00B 0.0% 0.00B
initial condition AMR 1 607μs 0.0% 607μs 265KiB 0.0% 265KiB
AMR 1 599μs 0.0% 599μs 264KiB 0.0% 264KiB
indicator 1 503μs 0.0% 503μs 65.3KiB 0.0% 65.3KiB
~AMR~ 1 95.5μs 0.0% 95.5μs 199KiB 0.0% 199KiB
refine 1 167ns 0.0% 167ns 64.0B 0.0% 64.0B
coarsen 1 41.0ns 0.0% 41.0ns 64.0B 0.0% 64.0B
~initial condition AMR~ 1 7.79μs 0.0% 7.79μs 752B 0.0% 752B
──────────────────────────────────────────────────────────────────────────────────────
This is most likely caused by https://github.com/JuliaSIMD/Polyester.jl/issues/89.
Some non-allocating @batch
loop causes allocating non-threaded loops to freeze for some reason.
Note that I had to use #1462 for these benchmarks, since the indicators don't work on macOS ARM otherwise.
A workaround would be to identify the non-threaded loops that are slow and add
ThreadingUtilities.sleep_all_tasks()
before them.
A better option would obviously be to find an upstream fix for https://github.com/JuliaSIMD/Polyester.jl/issues/89.
Do you know why it only seems to affect the AMRCallback
?
I haven't tested a lot, I just noticed that while making #1462. Maybe other parts of the code are affected as well.
But the AMR callback is prone to this bug because that's where allocations happen. As I said, using @batch
somewhere can cause completely unrelated allocating loops to be slow.
So the best workaround seems to be something like
ThreadingUtilities.sleep_all_tasks()
# serial, allocating stuff
GC.gc()
I found that this affects pretty much all elixirs. Even the default example is lagging (hanging for about a second every few thousand time steps). Interestingly, I can't see this when I benchmark just the rhs!
. Only when I run it multiple times:
julia> @benchmark Trixi.rhs!($du_ode, $u_ode, $semi, $0.0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 16.708 μs … 80.750 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 19.375 μs ┊ GC (median): 0.00%
Time (mean ± σ): 19.807 μs ± 1.733 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▃ █▇ ▁▁▃▅ ▂ ▂ ▁ ▂
▅▁▄▃▃▃▁▇█▁▇█▇▆▃▅██▅▆▇█████▄▁▅▄▅▄██▆▃▃▁▁▁▅██▃▁▄▃▁▃▅█▆▁▁▁▁▁▃▆ █
16.7 μs Histogram: log(frequency) by time 26 μs <
Memory estimate: 1.41 KiB, allocs estimate: 5.
julia> @benchmark for i in 1:10 Trixi.rhs!($du_ode, $u_ode, $semi, $0.0) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 154.917 μs … 1.406 s ┊ GC (min … max): 0.00% … 0.22%
Time (median): 202.605 μs ┊ GC (median): 0.00%
Time (mean ± σ): 345.450 μs ± 14.058 ms ┊ GC (mean ± σ): 0.09% ± 0.00%
█▆▂
▂▁▁▁▁▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▃▇███▄▆▆▅▅▄▄▃▃▂▃▃▂▂▂▂▂▂▂ ▃
155 μs Histogram: frequency by time 226 μs <
Memory estimate: 14.06 KiB, allocs estimate: 50.
Since https://github.com/JuliaSIMD/Polyester.jl/issues/89 still doesn't have any other solution than sleep_all_tasks()
, I played around with this a bit more. I change the @threaded
macro to
macro threaded(expr)
return esc(quote
Trixi.@batch $(expr)
ThreadingUtilities.sleep_all_tasks()
end)
end
This removed all lagging, and made my example go from 15s to 10s, but it makes each threaded loop take longer because the threads have to wake up first, resulting in a 2x slowdown of the rhs!
:
julia> @benchmark Trixi.rhs!($du_ode, $u_ode, $semi, $0.0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 32.875 μs … 191.417 μs ┊ GC (min … max): 0.00% … 0.00%
Time (median): 38.083 μs ┊ GC (median): 0.00%
Time (mean ± σ): 38.593 μs ± 3.385 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
▂ ▁▆ ▃█ ▁▆ ▄ ▃ ▂ ▁ ▁ ▂
▃▄▄▁▆▃▄▅▆█▅▄▅██▆▅▆██▆▄▄██▅▅▄▇█▃▁▃▅█▄▁▁▅█▅▃▁▄█▆▃▁▁█▆▁▁▃█▄▁▁▄█ █
32.9 μs Histogram: log(frequency) by time 48.8 μs <
Memory estimate: 2.31 KiB, allocs estimate: 34.
I guess then without this bug, the simulation would go down to about 5-6s, so this bug basically causes everything to be about 2x slower on macOS ARM.