Trixi.jl `AMRCallback` non-deterministically slow with multiple threads on macOS ARM

`AMRCallback` non-deterministically slow with multiple threads on macOS ARM

Open efaulhaber opened this issue 1 year ago • 5 comments

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_blob_amr.jl", tspan=(0.0, 0.1))
[...]

julia> u_ode = copy(sol[end]);

julia> @benchmark $(amr_callback.affect!)($u_ode, $semi, $0.0, $100)
BenchmarkTools.Trial: 950 samples with 1 evaluation.
 Range (min … max):  246.334 μs …   1.404 s  ┊ GC (min … max): 0.00% … 0.74%
 Time  (median):     279.062 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):     6.163 ms ± 90.551 ms  ┊ GC (mean ± σ):  0.58% ± 0.04%

    ▁▁ ▅▁          ▁▂▅▆█▇▄▃                                     
  ▄██████▇▆▄▄▄▃▃▃▄▇████████▆▆▆▆█▆▅▅▄▄▄▄▄▃▃▂▃▃▃▂▂▂▂▂▂▁▁▂▁▂▁▁▁▁▂ ▄
  246 μs          Histogram: frequency by time          342 μs <

 Memory estimate: 640.44 KiB, allocs estimate: 117.

About 4 executions out of 100 are extremely slow (over 1s vs 250µs). This is consistent with what I see in the simulation. It runs for about 200 time steps, and then it freezes for a second or so. That causes AMR to take over 80% of the simulation time.

One thread:

trixi_include("examples/tree_2d_dgsem/elixir_euler_blob_amr.jl", tspan=(0.0, 0.5))
[...]

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 0.5  Time steps: 644 (accepted), 644 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

 ──────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                    Allocations      
                                     ───────────────────────   ────────────────────────
          Tot / % measured:               3.35s /  97.1%            512MiB /  99.9%    

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 rhs!                         3.22k    2.80s   86.1%   870μs   11.5KiB    0.0%    3.67B
   volume integral            3.22k    2.58s   79.2%   800μs   2.20KiB    0.0%    0.70B
     blended DG-FV            3.22k    2.45s   75.2%   760μs     0.00B    0.0%    0.00B
     blending factors         3.22k    123ms    3.8%  38.0μs     0.00B    0.0%    0.00B
     ~volume integral~        3.22k   6.49ms    0.2%  2.01μs   2.20KiB    0.0%    0.70B
     pure DG                  3.22k   40.2μs    0.0%  12.5ns     0.00B    0.0%    0.00B
   interface flux             3.22k   97.4ms    3.0%  30.2μs     0.00B    0.0%    0.00B
   surface integral           3.22k   34.9ms    1.1%  10.8μs     0.00B    0.0%    0.00B
   mortar flux                3.22k   27.8ms    0.9%  8.64μs     0.00B    0.0%    0.00B
   prolong2interfaces         3.22k   27.4ms    0.8%  8.51μs     0.00B    0.0%    0.00B
   Jacobian                   3.22k   17.4ms    0.5%  5.41μs     0.00B    0.0%    0.00B
   prolong2mortars            3.22k   12.1ms    0.4%  3.76μs     0.00B    0.0%    0.00B
   reset ∂u/∂t                3.22k   5.41ms    0.2%  1.68μs     0.00B    0.0%    0.00B
   ~rhs!~                     3.22k   2.56ms    0.1%   795ns   9.33KiB    0.0%    2.97B
   prolong2boundaries         3.22k   81.9μs    0.0%  25.4ns     0.00B    0.0%    0.00B
   source terms               3.22k   42.5μs    0.0%  13.2ns     0.00B    0.0%    0.00B
   boundary flux              3.22k   32.9μs    0.0%  10.2ns     0.00B    0.0%    0.00B
 AMR                            643    310ms    9.5%   483μs    451MiB   88.2%   719KiB
   coarsen                      643    157ms    4.8%   244μs    236MiB   46.1%   375KiB
     solver                     643   96.1ms    3.0%   150μs    216MiB   42.2%   343KiB
     mesh                       643   53.3ms    1.6%  82.9μs   3.08MiB    0.6%  4.91KiB
     ~coarsen~                  643   7.14ms    0.2%  11.1μs   17.1MiB    3.3%  27.2KiB
   refine                       643    103ms    3.2%   161μs    207MiB   40.4%   329KiB
     solver                     639   62.7ms    1.9%  98.1μs    202MiB   39.5%   324KiB
     mesh                       639   40.4ms    1.2%  63.2μs   4.61MiB    0.9%  7.39KiB
       refine_unbalanced!       639   30.7ms    0.9%  48.1μs   68.6KiB    0.0%     110B
       rebalance!               924   8.26ms    0.3%  8.94μs   1.25MiB    0.2%  1.39KiB
       ~mesh~                   639   1.40ms    0.0%  2.19μs   3.29MiB    0.6%  5.27KiB
     ~refine~                   643    422μs    0.0%   656ns    148KiB    0.0%     236B
   indicator                    643   47.6ms    1.5%  74.0μs   4.08MiB    0.8%  6.50KiB
   ~AMR~                        643   2.72ms    0.1%  4.23μs   4.73MiB    0.9%  7.53KiB
 analyze solution                 8   97.1ms    3.0%  12.1ms   43.2MiB    8.4%  5.40MiB
 I/O                              9   24.1ms    0.7%  2.68ms   16.7MiB    3.3%  1.85MiB
   save solution                  8   13.7ms    0.4%  1.71ms   16.0MiB    3.1%  2.00MiB
   save mesh                      8   7.88ms    0.2%   985μs    289KiB    0.1%  36.2KiB
   get element variables          8   1.38ms    0.0%   173μs    118KiB    0.0%  14.8KiB
   ~I/O~                          9   1.14ms    0.0%   127μs    306KiB    0.1%  34.0KiB
 calculate dt                   645   19.9ms    0.6%  30.9μs     0.00B    0.0%    0.00B
 initial condition AMR            1    863μs    0.0%   863μs    264KiB    0.1%   264KiB
   AMR                            1    860μs    0.0%   860μs    263KiB    0.1%   263KiB
     indicator                    1    819μs    0.0%   819μs   64.1KiB    0.0%  64.1KiB
     ~AMR~                        1   41.0μs    0.0%  41.0μs    199KiB    0.0%   199KiB
     refine                       1    209ns    0.0%   209ns     64.0B    0.0%    64.0B
     coarsen                      1    125ns    0.0%   125ns     64.0B    0.0%    64.0B
   ~initial condition AMR~        1   2.54μs    0.0%  2.54μs      752B    0.0%     752B
 ──────────────────────────────────────────────────────────────────────────────────────

Multiple threads:

julia> trixi_include("examples/tree_2d_dgsem/elixir_euler_blob_amr.jl", tspan=(0.0, 0.5))
[...]

────────────────────────────────────────────────────────────────────────────────────────────────────
Trixi.jl simulation finished.  Final time: 0.5  Time steps: 644 (accepted), 644 (total)
────────────────────────────────────────────────────────────────────────────────────────────────────

 ──────────────────────────────────────────────────────────────────────────────────────
               Trixi.jl                       Time                    Allocations      
                                     ───────────────────────   ────────────────────────
          Tot / % measured:               5.89s /  93.1%            524MiB /  99.9%    

 Section                     ncalls     time    %tot     avg     alloc    %tot      avg
 ──────────────────────────────────────────────────────────────────────────────────────
 AMR                            643    4.44s   81.0%  6.91ms    452MiB   86.4%   720KiB
   coarsen                      643    2.89s   52.7%  4.49ms    236MiB   45.1%   375KiB
     solver                     643    1.45s   26.5%  2.26ms    216MiB   41.2%   343KiB
     mesh                       643    1.43s   26.1%  2.23ms   3.09MiB    0.6%  4.91KiB
     ~coarsen~                  643   4.64ms    0.1%  7.22μs   17.1MiB    3.3%  27.2KiB
   refine                       643    1.52s   27.8%  2.37ms    207MiB   39.5%   329KiB
     solver                     639    1.48s   26.9%  2.31ms    202MiB   38.6%   324KiB
     mesh                       639   46.4ms    0.8%  72.6μs   4.61MiB    0.9%  7.39KiB
       refine_unbalanced!       639   35.3ms    0.6%  55.3μs   68.6KiB    0.0%     110B
       rebalance!               924   8.94ms    0.2%  9.68μs   1.25MiB    0.2%  1.39KiB
       ~mesh~                   639   2.12ms    0.0%  3.31μs   3.29MiB    0.6%  5.27KiB
     ~refine~                   643    500μs    0.0%   778ns    148KiB    0.0%     236B
   indicator                    643   27.8ms    0.5%  43.2μs   4.85MiB    0.9%  7.72KiB
   ~AMR~                        643   3.43ms    0.1%  5.33μs   4.73MiB    0.9%  7.53KiB
 rhs!                         3.22k    925ms   16.9%   287μs   11.0MiB    2.1%  3.48KiB
   volume integral            3.22k    703ms   12.8%   218μs   3.05MiB    0.6%     993B
     blended DG-FV            3.22k    634ms   11.6%   197μs   1.52MiB    0.3%     496B
     blending factors         3.22k   61.1ms    1.1%  19.0μs   1.52MiB    0.3%     496B
     ~volume integral~        3.22k   7.40ms    0.1%  2.30μs   2.20KiB    0.0%    0.70B
     pure DG                  3.22k   57.1μs    0.0%  17.7ns     0.00B    0.0%    0.00B
   reset ∂u/∂t                3.22k    101ms    1.8%  31.4μs     0.00B    0.0%    0.00B
   interface flux             3.22k   37.0ms    0.7%  11.5μs   1.38MiB    0.3%     448B
   prolong2interfaces         3.22k   22.9ms    0.4%  7.12μs   1.03MiB    0.2%     336B
   mortar flux                3.22k   18.5ms    0.3%  5.74μs   1.76MiB    0.3%     574B
   surface integral           3.22k   16.1ms    0.3%  4.99μs   1.13MiB    0.2%     368B
   prolong2mortars            3.22k   13.7ms    0.2%  4.25μs   1.37MiB    0.3%     446B
   Jacobian                   3.22k   10.0ms    0.2%  3.09μs   1.23MiB    0.2%     400B
   ~rhs!~                     3.22k   2.63ms    0.0%   817ns   9.33KiB    0.0%    2.97B
   prolong2boundaries         3.22k   74.1μs    0.0%  23.0ns     0.00B    0.0%    0.00B
   boundary flux              3.22k   27.7μs    0.0%  8.59ns     0.00B    0.0%    0.00B
   source terms               3.22k   26.6μs    0.0%  8.27ns     0.00B    0.0%    0.00B
 analyze solution                 8   65.9ms    1.2%  8.23ms   43.2MiB    8.3%  5.40MiB
 I/O                              9   28.6ms    0.5%  3.18ms   16.7MiB    3.2%  1.86MiB
   save solution                  8   16.2ms    0.3%  2.02ms   16.0MiB    3.1%  2.00MiB
   save mesh                      8   10.0ms    0.2%  1.25ms    289KiB    0.1%  36.1KiB
   ~I/O~                          9   1.73ms    0.0%   192μs    306KiB    0.1%  34.0KiB
   get element variables          8    725μs    0.0%  90.6μs    126KiB    0.0%  15.8KiB
 calculate dt                   645   21.4ms    0.4%  33.1μs     0.00B    0.0%    0.00B
 initial condition AMR            1    607μs    0.0%   607μs    265KiB    0.0%   265KiB
   AMR                            1    599μs    0.0%   599μs    264KiB    0.0%   264KiB
     indicator                    1    503μs    0.0%   503μs   65.3KiB    0.0%  65.3KiB
     ~AMR~                        1   95.5μs    0.0%  95.5μs    199KiB    0.0%   199KiB
     refine                       1    167ns    0.0%   167ns     64.0B    0.0%    64.0B
     coarsen                      1   41.0ns    0.0%  41.0ns     64.0B    0.0%    64.0B
   ~initial condition AMR~        1   7.79μs    0.0%  7.79μs      752B    0.0%     752B
 ──────────────────────────────────────────────────────────────────────────────────────

This is most likely caused by https://github.com/JuliaSIMD/Polyester.jl/issues/89. Some non-allocating @batch loop causes allocating non-threaded loops to freeze for some reason.

Note that I had to use #1462 for these benchmarks, since the indicators don't work on macOS ARM otherwise.

May 13 '23 12:05 efaulhaber

A workaround would be to identify the non-threaded loops that are slow and add

ThreadingUtilities.sleep_all_tasks()

before them.

A better option would obviously be to find an upstream fix for https://github.com/JuliaSIMD/Polyester.jl/issues/89.

May 13 '23 14:05 efaulhaber

Do you know why it only seems to affect the AMRCallback?

May 13 '23 16:05 sloede

I haven't tested a lot, I just noticed that while making #1462. Maybe other parts of the code are affected as well.

But the AMR callback is prone to this bug because that's where allocations happen. As I said, using @batch somewhere can cause completely unrelated allocating loops to be slow.

May 13 '23 16:05 efaulhaber

So the best workaround seems to be something like

ThreadingUtilities.sleep_all_tasks()

# serial, allocating stuff

GC.gc()

May 14 '23 05:05 ranocha

I found that this affects pretty much all elixirs. Even the default example is lagging (hanging for about a second every few thousand time steps). Interestingly, I can't see this when I benchmark just the rhs!. Only when I run it multiple times:

julia> @benchmark Trixi.rhs!($du_ode, $u_ode, $semi, $0.0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  16.708 μs … 80.750 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     19.375 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   19.807 μs ±  1.733 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

          ▃       █▇    ▁▁▃▅       ▂        ▂       ▁         ▂
  ▅▁▄▃▃▃▁▇█▁▇█▇▆▃▅██▅▆▇█████▄▁▅▄▅▄██▆▃▃▁▁▁▅██▃▁▄▃▁▃▅█▆▁▁▁▁▁▃▆ █
  16.7 μs      Histogram: log(frequency) by time        26 μs <

 Memory estimate: 1.41 KiB, allocs estimate: 5.

julia> @benchmark for i in 1:10 Trixi.rhs!($du_ode, $u_ode, $semi, $0.0) end
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  154.917 μs …   1.406 s  ┊ GC (min … max): 0.00% … 0.22%
 Time  (median):     202.605 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   345.450 μs ± 14.058 ms  ┊ GC (mean ± σ):  0.09% ± 0.00%

                                        █▆▂                     
  ▂▁▁▁▁▁▂▁▂▁▁▁▁▂▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▂▂▂▃▇███▄▆▆▅▅▄▄▃▃▂▃▃▂▂▂▂▂▂▂ ▃
  155 μs          Histogram: frequency by time          226 μs <

 Memory estimate: 14.06 KiB, allocs estimate: 50.

Since https://github.com/JuliaSIMD/Polyester.jl/issues/89 still doesn't have any other solution than sleep_all_tasks(), I played around with this a bit more. I change the @threaded macro to

macro threaded(expr)
    return esc(quote
                   Trixi.@batch $(expr)
                   ThreadingUtilities.sleep_all_tasks()
               end)
end

This removed all lagging, and made my example go from 15s to 10s, but it makes each threaded loop take longer because the threads have to wake up first, resulting in a 2x slowdown of the rhs!:

julia> @benchmark Trixi.rhs!($du_ode, $u_ode, $semi, $0.0)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  32.875 μs … 191.417 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     38.083 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   38.593 μs ±   3.385 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%

           ▂   ▁▆   ▃█   ▁▆    ▄    ▃    ▂    ▁    ▁           ▂
  ▃▄▄▁▆▃▄▅▆█▅▄▅██▆▅▆██▆▄▄██▅▅▄▇█▃▁▃▅█▄▁▁▅█▅▃▁▄█▆▃▁▁█▆▁▁▃█▄▁▁▄█ █
  32.9 μs       Histogram: log(frequency) by time      48.8 μs <

 Memory estimate: 2.31 KiB, allocs estimate: 34.

I guess then without this bug, the simulation would go down to about 5-6s, so this bug basically causes everything to be about 2x slower on macOS ARM.

Nov 09 '23 15:11 efaulhaber

Trixi.jl Trixi.jl copied to clipboard

`AMRCallback` non-deterministically slow with multiple threads on macOS ARM

Trixi.jl
Trixi.jl copied to clipboard