amr-wind icon indicating copy to clipboard operation
amr-wind copied to clipboard

MAC projection maxing out on ABL calculations

Open lawrenceccheung opened this issue 1 year ago • 14 comments

I'm running a stable ABL case on Summit, and now encountering a situation where the MAC projection iterations are maxing out. Contrary to issue https://github.com/Exawind/amr-wind/issues/859, the nodal projections are fine, but the MAC projections hit 200 iterations immediately.

The stable case input file we're trying to run is here: https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/StableABL_precursor2.inp

A few things to note:

  • If we switch from MLMG to hypre, (see this input file), the MAC projection converges, but it takes a lot longer every timestep.
  • If we stick to the level 0 mesh only (remove all refinements), the case runs fine
  • This ABL has run fine before: it uses the exact same boundary conditions as this case, except it's larger and has refinement levels. It's also the same stable ABL as this case here, which also included mesh refinement.

There was another case where MAC_projections also failed on an unstable ABL case with multiple refinement levels, and switching to hypre allowed it to keep running (see input file here), but this is again non-optimal -- ideally MLMG should work in these cases.

Lawrence

lawrenceccheung avatar Jul 25 '23 20:07 lawrenceccheung

Lawrence, I've run into similar issues in the past and maybe our issues are related. My issue was with the bottom solver, meaning my smallest problem for the multigrid solver was still too large. I had round number of grid points in some directions and that was only divisible by 2 so many times. I also had issues with refinements, and that was especially bad when the refinements were touching the bottom boundary.

Looking at your number of cells, that seems similar to what I had. If you want to rule that out, you could try modifying the amr.n_cell to powers of 2 or something close (like 4096 5120 96 in your case).

rthedin avatar Jul 26 '23 22:07 rthedin

Thanks @rthedin, these are good suggestions. One thing I tried was to move the refinement zones higher so they wouldn't touch the bottom boundary, but it didn't help with MAC projection. Although, we do need refinement zones close to the ground surface for this application, so it wouldn't have been a perfect solution anyway.

Yes, I think there is something possibly going on with the cell counts or grid sizes. Changing the amr.blocking_factor and the amr.max_grid_size in the unstable ABL case where MAC_projections failed didn't help, but maybe a strict power of 2 is necessary....

Lawrence

lawrenceccheung avatar Jul 27 '23 17:07 lawrenceccheung

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Aug 27 '23 02:08 github-actions[bot]

More updates on this MAC projection issues: I tried this out on Frontier, and see the same problem happening, so it's independent of the machine architecture. However, it does look like it's sensitive to a number of things, including:

  • mesh domain/n_cell
  • the number of cores that are used
  • the forcing used.

If you're interested in trying this out, StableABL_precursor1.inp is another case, which has the same ABL BC's as StableABL_precursor2.inp, but set up on a slightly smaller domain and with different mesh counts. I'm also just testing it out for 10 iterations to use the debug queue.

  1. It works on 128 nodes/1024 GPU's on Frontier. The first nodal projection step takes 92 iterations, but after that, both MAC and nodal projection seem to converge within 10 iterations.
  2. If you use 256 nodes/2048 GPU's, the MAC projection's max out
  3. If you turn on ABLMeanBoussinesq, the MAC projection's also max out.

Any thoughts @psakievich, @asalmgren, or @jrood-nrel? I can try other cases or mesh counts, but this seems to be fairly hit-or-miss at this point.

Lawrence

lawrenceccheung avatar Sep 11 '23 18:09 lawrenceccheung

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Oct 12 '23 02:10 github-actions[bot]

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Dec 14 '23 02:12 github-actions[bot]

Just to confirm on a recent build of AMR-Wind a1caec3 on Frontier, the MAC projections are still hitting the limits when I run on GPU's. I haven't been able to test on very many different combinations of GPU's/CPU's, but other ABL cases with level 0 only have been running fine on Frontier so far.

Lawrence

lawrenceccheung avatar Dec 28 '23 17:12 lawrenceccheung

Do we have a specific case that runs on cpu and fails on gpu?

Sent from my iPhone

On Dec 28, 2023, at 9:13 AM, lawrenceccheung @.***> wrote:



Assigned #886 https://github.com/Exawind/amr-wind/issues/886 to @asalmgren https://github.com/asalmgren.

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/886#event-11352270576, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YRFK44INZTZ4NZO7VDYLWSETAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJRGM2TEMRXGA2TONQ . You are receiving this because you were assigned.Message ID: @.***>

asalmgren avatar Dec 28 '23 17:12 asalmgren

I don't have any case right now which consistently runs on CPU's and fails on GPU's, but because there seems to be a processor dependence, it's likely that the case StableABL_precursor1.inp mentioned above will work on some number of CPU's and fail on some other number of GPU's.

Lawrence

lawrenceccheung avatar Dec 28 '23 17:12 lawrenceccheung

Maybe the best use of everyone’s time is to wait until the next time this happens so we can go after a specific case?

Sent from my iPhone

On Dec 28, 2023, at 9:25 AM, lawrenceccheung @.***> wrote:



I don't have any case right now which consistently runs on CPU's and fails on GPU's, but because there seems to be a processor dependence, it's likely that the case StableABL_precursor1.inp https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/StableABL_precursor1.inp mentioned above will work on some number of CPU's and fail on some other number of GPU's.

Lawrence

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/886#issuecomment-1871362685, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YQ74V6RAGGZPVK26XLYLWTRBAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNZRGM3DENRYGU . You are receiving this because you were mentioned.Message ID: @.***>

asalmgren avatar Dec 30 '23 18:12 asalmgren

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar Jan 30 '24 02:01 github-actions[bot]

I just changed a working case from a blockIng_factor of 32 to 64 and the simulation went from fine to maxing out iterations on all solvers. It was on the GPU and I haven't tried it on the CPU.

jrood-nrel avatar Apr 19 '24 14:04 jrood-nrel

@jrood-nrel - is there any update on this? It would be good to determine if it's machine weirdness or something in the code we can fix...

asalmgren avatar Apr 21 '24 15:04 asalmgren

This issue is stale because it has been open 30 days with no activity.

github-actions[bot] avatar May 22 '24 02:05 github-actions[bot]

I am just playing catchup here and trying to replicate what others have done.

Preliminaries

  • The first thing I noticed was that the time step for the input file I was given was a bit large and led to CFL violation warnings. So I dropped the time step a bit to avoid that.
  • This is the input file I am running: StableABL_precursor1.inp.txt
  • I am using this command to run on Frontier GPUs:
 srun -N128 -n1024 -c1 --gpus-per-node=8 --gpu-bind=closest amr_wind StableABL_precursor1.inp time.max_step=20 amrex.abort_on_out_of_gpu_memory=1 amrex.the_arena_is_managed=0 amr.blocking_factor=16 amr.max_grid_size=128 amrex.use_profiler_syncs=0 amrex.async_out=0
  • I am using this version of amr-wind:
==============================================================================
                AMR-Wind (https://github.com/exawind/amr-wind)

  AMR-Wind version :: v2.1.0-13-ge986100e
  AMR-Wind Git SHA :: e986100e5722648d991f9102c7b3859b1d1a03a5
  AMReX version    :: 24.05-20-g5d02c6480a0d

  Exec. time       :: Fri May 31 16:44:38 2024
  Build time       :: May 24 2024 12:40:23
  C++ compiler     :: Clang 17.0.0

  MPI              :: ON    (Num. ranks = 1024)
  GPU              :: ON    (Backend: HIP)
  OpenMP           :: OFF

  Enabled third-party libraries:
    NetCDF    4.9.2

           This software is released under the BSD 3-clause license.
 See https://github.com/Exawind/amr-wind/blob/development/LICENSE for details.
-----------------------------------------------------------------------------

Observations

  • This case is a pain to debug. Turn around time in the Frontier debug queue for 1 run is several hours. The run itself is about 7mins but you end up sitting in the queue forever. Here's a summary of the grid.
Grid summary:
  Level 0   1848 grids  1200291840 cells  100 % of domain
            smallest grid: 128 x 32 x 80  biggest grid: 128 x 64 x 80
  Level 1   3100 grids  4846387200 cells  50.47092547 % of domain
            smallest grid: 112 x 128 x 96  biggest grid: 128 x 128 x 96
  Level 2   6729 grids  13753548800 cells  17.90391244 % of domain
            smallest grid: 32 x 16 x 128  biggest grid: 128 x 128 x 128
  • I did a run with 128 nodes and one with 256 nodes. They both show MAC_projection hitting 200 iterations but at different time steps. And the one with fewer nodes shows just 1 instance of this in the first 20 stepts, while the 256 node case shows 4 instances of this.
  • These max iters seem to happen early in the run. In the first 10 steps, not in the next 10 steps. I haven't run more of this.
  • Comparing steps at which this happens across runs with different nodes doesn't lead to much information. In this snapshot, on step 4, the 256 node case hit the max iters. But looking at the values of these things... they all look pretty similar. Values of residuals, min/max velocities and their locations, etc, are all basically the same. Except obviously when the mac_projection line is different (O(1e-8) vs O(1e-9)): Screenshot 2024-06-03 at 9 55 15 AM

Next steps

@asalmgren I will take suggestions for things to try. I can try to make this case smaller but that's going to take a while to find a smaller case where this happens given that things keep changing with node counts, blocking factors, grids, etc. Maybe the mac projection tolerances are too tight?

marchdf avatar Jun 03 '24 16:06 marchdf

Marc -- we need to know why it isn't converging -- so the first thing to look at is whether the bottom solver is converging.

Turn on the bottom solver verbosity enough to tell whether it is maxing out on iterations.

Something that would also help would be to see what the residual is at each level up and down the V-cycle -- thal will tell us whether the issue is something in one of the higher AMR levels or something coarser than amr level 0.

Can you turn on more verbosity so we can see that as well?

One final thought -- I'd be more comfortable resolving the other MAC issue (with inconsistent filling of boundary values) before going after this -- is it possible that has caused this issue (i.e. if bc's are making it not solvable due to inconsistent filling of boundary values).

So maybe do both -- set off the runs with a bunch of verbosity then at the same time see if you can resolve the fillpatch issue?

Those are my best suggestions for path forward

On Mon, Jun 3, 2024 at 9:15 AM Marc T. Henry de Frahan < @.***> wrote:

I am just playing catchup here and trying to replicate what others have done. Preliminaries

srun -N128 -n1024 -c1 --gpus-per-node=8 --gpu-bind=closest amr_wind StableABL_precursor1.inp time.max_step=20 amrex.abort_on_out_of_gpu_memory=1 amrex.the_arena_is_managed=0 amr.blocking_factor=16 amr.max_grid_size=128 amrex.use_profiler_syncs=0 amrex.async_out=0

  • I am using this version of amr-wind:

============================================================================== AMR-Wind (https://github.com/exawind/amr-wind)

AMR-Wind version :: v2.1.0-13-ge986100e AMR-Wind Git SHA :: e986100e5722648d991f9102c7b3859b1d1a03a5 AMReX version :: 24.05-20-g5d02c6480a0d

Exec. time :: Fri May 31 16:44:38 2024 Build time :: May 24 2024 12:40:23 C++ compiler :: Clang 17.0.0

MPI :: ON (Num. ranks = 1024) GPU :: ON (Backend: HIP) OpenMP :: OFF

Enabled third-party libraries: NetCDF 4.9.2

       This software is released under the BSD 3-clause license.

See https://github.com/Exawind/amr-wind/blob/development/LICENSE for details.

Observations

  • This case is a pain to debug. Turn around time in the Frontier debug queue for 1 run is several hours. The run itself is about 7mins but you end up sitting in the queue forever. Here's a summary of the grid.

Grid summary: Level 0 1848 grids 1200291840 cells 100 % of domain smallest grid: 128 x 32 x 80 biggest grid: 128 x 64 x 80 Level 1 3100 grids 4846387200 cells 50.47092547 % of domain smallest grid: 112 x 128 x 96 biggest grid: 128 x 128 x 96 Level 2 6729 grids 13753548800 cells 17.90391244 % of domain smallest grid: 32 x 16 x 128 biggest grid: 128 x 128 x 128

  • I did a run with 128 nodes and one with 256 nodes. They both show MAC_projection hitting 200 iterations but at different time steps. And the one with fewer nodes shows just 1 instance of this in the first 20 stepts, while the 256 node case shows 4 instances of this.
  • These max iters seem to happen early in the run. In the first 10 steps, not in the next 10 steps. I haven't run more of this.
  • Comparing steps at which this happens across runs with different nodes doesn't lead to much information. In this snapshot, on step 4, the 256 node case hit the max iters. But looking at the values of these things... they all look pretty similar. Values of residuals, min/max velocities and their locations, etc, are all basically the same. Except obviously when the mac_projection line is different (O(1e-8) vs O(1e-9)): Screenshot.2024-06-03.at.9.55.15.AM.png (view on web) https://github.com/Exawind/amr-wind/assets/15038415/579c157b-d035-404d-9eaa-7fca9667eb93

Next steps

@asalmgren https://github.com/asalmgren I will take suggestions for things to try. I can try to make this case smaller but that's going to take a while to find a smaller case where this happens given that things keep changing with node counts, blocking factors, grids, etc. Maybe the mac projection tolerances are too tight?

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/886#issuecomment-2145618108, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YRVO4EKZUDRFPON4HTZFSJDJAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGYYTQMJQHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

asalmgren avatar Jun 03 '24 16:06 asalmgren

Ok will give these a shot. The thought had also occurred to me about the other issue...

marchdf avatar Jun 03 '24 16:06 marchdf

Ok so I am getting a whole bunch of:

MLCGSolver_BiCGStab: Initial error (error0) =        1.349805439e-11
MLCGSolver_BiCGStab: Final: Iteration  201 rel. err. 0.003540589525
MLCGSolver_BiCGStab:: failed to converge!!
MLCGSolver_BiCGStab: Initial error (error0) =        1.344908633e-11
MLCGSolver_BiCGStab: Final: Iteration  201 rel. err. 0.005534598243
MLCGSolver_BiCGStab:: failed to converge!!
MLCGSolver_BiCGStab: Initial error (error0) =        1.340034281e-11
MLCGSolver_BiCGStab: Final: Iteration  201 rel. err. 0.002623853639
MLCGSolver_BiCGStab:: failed to converge!!
MLMG: Timers: Solve = 43.64502304 Iter = 43.60381034 Bottom = 28.08960259
  MAC_projection               200         0.01090075748       3.583547755e-08

and I am running with:

mac_proj.verbose = 1
mac_proj.bottom_verbose = 2

Is that verbose enough or do you want more? It's hard to know what the verbosity levels correspond to so I took a wild guess.

marchdf avatar Jun 03 '24 18:06 marchdf

If the bottom solver is going 1e-11 to 1e-14 that is fine.

Go ahead and set the bottom solver absolute tolerance to 1e-14.

Set mac_proj.verbose to 4

Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

On Mon, Jun 3, 2024 at 8:23 PM Marc T. Henry de Frahan < @.***> wrote:

Ok so I am getting a whole bunch of:

MLCGSolver_BiCGStab: Initial error (error0) = 1.349805439e-11 MLCGSolver_BiCGStab: Final: Iteration 201 rel. err. 0.003540589525 MLCGSolver_BiCGStab:: failed to converge!! MLCGSolver_BiCGStab: Initial error (error0) = 1.344908633e-11 MLCGSolver_BiCGStab: Final: Iteration 201 rel. err. 0.005534598243 MLCGSolver_BiCGStab:: failed to converge!! MLCGSolver_BiCGStab: Initial error (error0) = 1.340034281e-11 MLCGSolver_BiCGStab: Final: Iteration 201 rel. err. 0.002623853639 MLCGSolver_BiCGStab:: failed to converge!! MLMG: Timers: Solve = 43.64502304 Iter = 43.60381034 Bottom = 28.08960259 MAC_projection 200 0.01090075748 3.583547755e-08

and I am running with:

mac_proj.verbose = 1 mac_proj.bottom_verbose = 2

Is that verbose enough or do you want more? It's hard to know what the verbosity levels correspond to so I took a wild guess.

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/886#issuecomment-2145850250, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YXHTFUPAOQEEKEK3DLZFSYC3AVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVHA2TAMRVGA . You are receiving this because you were mentioned.Message ID: @.***>

asalmgren avatar Jun 03 '24 18:06 asalmgren

Things are getting verbose ;) Here's the output: debug_mac_segfault.o1997522.txt

marchdf avatar Jun 03 '24 20:06 marchdf

Ah ok -- that is useful.

Next question -- if you run this again with exactly the same executable and inputs file on the same number of ranks/nodes, will it fail exactly the same?

i.e. will you see exactly these same numbers at the same steps?

MAC_projection 9 0.01059370304 2.170464121e-09 MAC_projection 200 0.01090075747 3.406772667e-08 MAC_projection 200 0.01070578287 3.728472752e-08 MAC_projection 200 0.01059956843 3.751627153e-08 MAC_projection 200 0.01398727573 3.031940042e-08 MAC_projection 7 0.01349900866 4.186323505e-09 MAC_projection 7 0.0133575733 3.643233593e-09 MAC_projection 6 0.01556601993 1.358084751e-08 MAC_projection 6 0.01523504224 1.387033766e-08 MAC_projection 7 0.01389531987 3.696991249e-09 MAC_projection 7 0.01349196508 3.703382658e-09 MAC_projection 7 0.01397295297 3.724432835e-09 MAC_projection 7 0.0145134622 4.762901849e-09 MAC_projection 7 0.0155179704 4.115332929e-09 MAC_projection 7 0.01654709104 2.063066006e-09 MAC_projection 6 0.01760117101 1.673764457e-08 MAC_projection 6 0.01867633017 1.72870855e-08 MAC_projection 6 0.01973103806 1.747265777e-08 MAC_projection 6 0.02076201332 1.68252392e-08 MAC_projection 6 0.0217731548 1.751393021e-08 MAC_projection 6 0.02276651583 1.749625856e-08 MAC_projection 6 0.02374135767 1.806140752e-08 MAC_projection 6 0.02469769443 1.901520436e-08

On Mon, Jun 3, 2024 at 1:38 PM Marc T. Henry de Frahan < @.***> wrote:

Things are getting verbose ;) Here's the output: debug_mac_segfault.o1997522.txt https://github.com/user-attachments/files/15539414/debug_mac_segfault.o1997522.txt

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/886#issuecomment-2146079548, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YQBBJHLX2GBG2Z6E7TZFTH4BAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGA3TSNJUHA . You are receiving this because you were mentioned.Message ID: @.***>

-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

asalmgren avatar Jun 03 '24 20:06 asalmgren

It is non-deterministic. Just because things weren't fun enough. Here's the other log file so you can look as well: debug_mac_segfault.o1997967.txt

Screenshot 2024-06-03 at 3 37 16 PM

marchdf avatar Jun 03 '24 21:06 marchdf

ok just looking at the second MAC projection in each stack, you can see the differences in the first V-cycle.

this suggests something is either uninitialized or at least not consistently initialized in the data

let's get the other MAC (fillpatch) issue fixed first and see if unraveling that thread fixes this as well

On Mon, Jun 3, 2024 at 2:39 PM Marc T. Henry de Frahan < @.***> wrote:

It is non-deterministic. Just because things weren't fun enough. Here's the other log file so you can look as well: debug_mac_segfault.o1997967.txt https://github.com/user-attachments/files/15540032/debug_mac_segfault.o1997967.txt

Screenshot.2024-06-03.at.3.37.16.PM.png (view on web) https://github.com/Exawind/amr-wind/assets/15038415/6f41a9df-ef14-444d-8283-47c97d1522eb

— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/886#issuecomment-2146170974, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YVSAPQVAQ3EZU3ZYTTZFTPBRAVCNFSM6AAAAAA2XT7DB2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGE3TAOJXGQ . You are receiving this because you were mentioned.Message ID: @.***>

-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers

asalmgren avatar Jun 04 '24 06:06 asalmgren

Worked with @WeiqunZhang and @asalmgren on this. I think this is solved once this PR: https://github.com/AMReX-Codes/amrex/pull/3991 gets merged in. I ran with the following case StableABL_precursor1.inp.txt and all MAC_projection iterations are around 6-9.

Per @WeiqunZhang:

The observation is that for the failed bottom solves, the bottom solver was able to reduce the residual by 1.e-2, but not the target of 1.e-4. In the development branch of amrex, we discard that result and apply the smoother 8 times. That probably makes things worse compared to the result of bicgstab. In the draft PR, the starting point for the smoothing is the result of bicgstab if it is an improvement, even though unconverged. I also added an option to AMReX-Hydro to change the default number of smoothing after the bicgstab. The default in amrex is 8, but you can use something like mac_proj.num_final_smooth = 16 if 8 is not sufficient.

I will close this issue once I've updated the submodules.

marchdf avatar Jun 20 '24 15:06 marchdf