amr-wind
amr-wind copied to clipboard
Nodal projections maxing out on ABL calculation
I am re-running an ABL case as a part of AWAKEN, and with the latest build of amr-wind (a75d2ec013e15c567faa5873098a35a7b484c06a) the nodal_projections are maxing out. This is a case that I've run before, but I'm adding in different sampling planes. You can see the basic configuration here: https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/UnstableABL_farmrun1/UnstableABL_farmrun1_noturbs.inp, and the last time I ran this the, both the nodal projections and MAC projections only required 8 iterations per timestep.
I tried this case with a slightly older build of amr-wind (185c360a434cb002b620c2289b90e2343c96e524) from April, and the case is working fine with that exe. So sometime between then and now, something was introduced which affected the ABL solver. I'll continue trying to find which commit is causing the issue, but I'm curious if anybody else is seeing this problem.
Lawrence
This sounds like we are seeing for the blade resolved cases as well. @ashesh2512 @PaulMullowney @marchdf
There is a problem in the amr-wind boundary conditions. I hope that it's the same problem.
The nodal projections maxing in the middle of a GPU simulation has always been an issue for the large blade-resolved runs. I have observed the issue for over a year now.
This is a pretty high priority issue for AWAKEN. They have several runs planned for Summit before the ALCC allocation is up in the next few weeks. If anyone has ideas please jump on this.
Quick update: the problem occurs somewhere between 257c13c1a634d841627a04f83f1923b2a8556ca5 (May 1) and the current commit a75d2ec013e15c567faa5873098a35a7b484c06a.
Lawrence
@asalmgren Wanted to get this on your radar.
@lawrenceccheung -- could you do some additional git bisection to see which git ommit breaks things?
Yes, the latest bisection I did shows that the problem is happening somewhere between bbe0fddb and https://github.com/Exawind/amr-wind/commit/a75d2ec013e15c567faa5873098a35a7b484c06a. I also tried the very latest commit (4b71037), and that also maxes out on the nodal projection.
However, the more frustrating thing I've found is that this problem seems to have a random element to it. On a commit that I thought was working (9eb5e619, from Phil's b/awaken-runs branch), I resubmitted the exact same job with the same executable, and something that was working before is now maxing out on nodal_projections. Is there some Summit hardware component to this issue? Commits that were never working seem to be consistently failing, though.
Lawrence
@lawrenceccheung - when you run those specific commits are they always run with the amrex version in the submodule, or do you sometimes run with an external amrex?
If everything about the commits -- including version of amrex and amrex-hydro -- is the same but it now fails, that does suggest hardware and/or system software.
On Wed, Jun 28, 2023 at 9:34 AM lawrenceccheung @.***> wrote:
Yes, the latest bisection I did shows that the problem is happening somewhere between bbe0fdd https://github.com/Exawind/amr-wind/commit/bbe0fddb6106d6db53979e8647582ae3732e6d6f and a75d2ec https://github.com/Exawind/amr-wind/commit/a75d2ec013e15c567faa5873098a35a7b484c06a. I also tried the very latest commit (4b71037 https://github.com/Exawind/amr-wind/commit/4b71037218723e0c63d54c140423ef503ac3c912), and that also maxes out on the nodal projection.
However, the more frustrating thing I've found is that this problem seems to have a random element to it. On a commit that I thought was working ( 9eb5e61 https://github.com/Exawind/amr-wind/commit/9eb5e619dda5dbe6cd5c5683a3caae771772d19e, from Phil's b/awaken-runs branch), I resubmitted the exact same job with the same executable, and something that was working before is now maxing out on nodal_projections. Is there some Summit hardware component to this issue? Commits that were never working seem to be consistently failing, though.
Lawrence
— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/859#issuecomment-1611747922, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YXSQ4UUP4JMS4LUB2DXNRMLBANCNFSM6AAAAAAZMZNPE4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers
@asalmgren -- they're run with the amrex library as a submodule. Everything's been built with spack-manager.
Lawrence
Ok -- I'm assuming AMReX-Hydro is also a submodule.
What is your current perspective -- that there is a code change that broke something or that a change in the system has broken or exposed something?
On Wed, Jun 28, 2023 at 10:00 AM lawrenceccheung @.***> wrote:
@asalmgren https://github.com/asalmgren -- they're run with the amrex library as a submodule. Everything's been built with spack-manager.
Lawrence
— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/859#issuecomment-1611780048, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YRPUU3UN6PAUFQ73S3XNRPKVANCNFSM6AAAAAAZMZNPE4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers
Yes, everything is a submodule. My perspective is that something changed over the last two months -- either in the ExaWind code, or Summit hardware, or both -- which is causing the bottom solver to not converge. I'd like to eliminate the ExaWind code as a possible source of the problem, there are some commits which seem to always fail, so if we can get to a commit that at least works part of the time, we can go from there.
Ok, so let's go back to the git bisection approach, but maybe the test for "works" vs "fails" needs to be based on multiple runs rather than a single run?
Can you identify a single commit where -- on today's hardware and software stack -- things go from "mostly working" to "mostly failing"?
When you say it's the bottom solver that is failing, is that hypre or the amrex default BiCG?
On Wed, Jun 28, 2023 at 11:14 AM lawrenceccheung @.***> wrote:
Yes, everything is a submodule. My perspective is that something changed over the last two months -- either in the ExaWind code, or Summit hardware, or both -- which is causing the bottom solver to not converge. I'd like to eliminate the ExaWind code as a possible source of the problem, there are some commits which seem to always fail, so if we can get to a commit that at least works part of the time, we can go from there.
— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/859#issuecomment-1611872265, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YSLJOQSSIH3WNNMLL3XNRYA7ANCNFSM6AAAAAAZMZNPE4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers
These cases I'm running are without hypre, so with the amrex defaults. Another data point we just got is that if we run just a simple, single-level precursor: https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/KingPlains_stable_precursor9.inp then the latest commit works fine, no issues with MAC or nodal projection. However, the production cases using multiple levels, inflow/outflow BC, with and without turbines see projection problems.
I'll continue bisecting to see if I can isolate the problem down to a single commit, but obviously we will need to run multiple times to get a sense of whether things are truly working or not.
Lawrence
These symptoms are identical to what Marc and I debugged last week.
Lawrence -- are you running the current version with all of Paul's fixes?
On Wed, Jun 28, 2023 at 12:37 PM PaulMullowney @.***> wrote:
These symptoms are identical to what Marc and I debugged last week.
— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/859#issuecomment-1611997590, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YQLTYOZOLE3RHWUIRDXNSBVXANCNFSM6AAAAAAZMZNPE4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers
Specifically, you need 5ae3533
Yes, I tried out 4b71037218723e0c63d54c140423ef503ac3c912 which includes Paul's fixes.
Just to keep everyone up to date, I talked with @PaulMullowney and @psakievich earlier, and I'm going to get some debug information from the nodal projection operation to help diagnose things. We will also try this problem with a CPU-only build on Summit to see if that has different behavior.
Lawrence
More data for those interested in this problem. Here's the verbose output from the nodal projection:
Nodal Projection:
>> Before projection:
* On lev 0 max(abs(rhs)) = 0.05129219173
* On lev 1 max(abs(rhs)) = 0.08140351992
* On lev 2 max(abs(rhs)) = 0.06800058622
* On lev 3 max(abs(rhs)) = 0.1416014512
MLMG: # of AMR levels: 4
# of MG levels on the coarsest AMR level: 7
MLMG: Initial rhs = 0.1416014512
MLMG: Initial residual (resid0) = 0.1416014512
MLMG: Iteration 1 Fine resid/bnorm = 0.003253548735
MLMG: Iteration 2 Fine resid/bnorm = 0.0001444938807
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 3 Fine resid/bnorm = 2.6269658e-05
MLMG: Iteration 4 Fine resid/bnorm = 5.256018808e-06
MLMG: Iteration 5 Fine resid/bnorm = 1.160243444e-06
MLMG: Iteration 6 Fine resid/bnorm = 2.735217813e-07
MLMG: Iteration 6 Crse resid/bnorm = 0.01787294653
MLMG: Iteration 7 Fine resid/bnorm = 6.391661217e-08
MLMG: Iteration 7 Crse resid/bnorm = 0.01787773617
MLMG: Iteration 8 Fine resid/bnorm = 1.489605667e-08
MLMG: Iteration 8 Crse resid/bnorm = 0.01788884284
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 9 Fine resid/bnorm = 3.472879871e-09
MLMG: Iteration 9 Crse resid/bnorm = 0.01787472629
MLMG: Iteration 10 Fine resid/bnorm = 8.11588805e-10
MLMG: Iteration 10 Crse resid/bnorm = 0.01787888439
MLMG: Iteration 11 Fine resid/bnorm = 1.903855552e-10
MLMG: Iteration 11 Crse resid/bnorm = 0.01787938321
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 12 Fine resid/bnorm = 4.502377799e-11
MLMG: Iteration 12 Crse resid/bnorm = 0.01787897501
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 13 Fine resid/bnorm = 1.091071081e-11
MLMG: Iteration 13 Crse resid/bnorm = 0.01787297062
MLMG: Iteration 14 Fine resid/bnorm = 2.759064114e-12
MLMG: Iteration 14 Crse resid/bnorm = 0.01789029449
MLMG: Iteration 15 Fine resid/bnorm = 5.996495376e-13
MLMG: Iteration 15 Crse resid/bnorm = 0.01787960706
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 16 Fine resid/bnorm = 1.656790951e-13
MLMG: Iteration 16 Crse resid/bnorm = 0.01787786377
MLMG: Iteration 17 Fine resid/bnorm = 3.194018445e-13
MLMG: Iteration 17 Crse resid/bnorm = 0.01787936302
MLMG: Iteration 18 Fine resid/bnorm = 2.408747894e-12
MLMG: Iteration 18 Crse resid/bnorm = 0.01787785641
MLMG: Iteration 19 Fine resid/bnorm = 6.594079131e-14
MLMG: Iteration 19 Crse resid/bnorm = 0.01787782496
MLMG: Iteration 20 Fine resid/bnorm = 1.525122225e-14
MLMG: Iteration 20 Crse resid/bnorm = 0.01787782455
MLMG: Iteration 21 Fine resid/bnorm = 2.666321369e-12
MLMG: Iteration 21 Crse resid/bnorm = 0.01787936213
MLMG: Iteration 22 Fine resid/bnorm = 3.029578972e-13
MLMG: Iteration 22 Crse resid/bnorm = 0.01787939374
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 23 Fine resid/bnorm = 2.882786317e-13
MLMG: Iteration 23 Crse resid/bnorm = 0.01787785749
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 24 Fine resid/bnorm = 2.615021368e-13
MLMG: Iteration 24 Crse resid/bnorm = 0.0178782228
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 25 Fine resid/bnorm = 2.342811696e-13
MLMG: Iteration 25 Crse resid/bnorm = 0.017872955
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 26 Fine resid/bnorm = 2.099695633e-13
MLMG: Iteration 26 Crse resid/bnorm = 0.01787854881
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 27 Fine resid/bnorm = 1.886630268e-13
MLMG: Iteration 27 Crse resid/bnorm = 0.01787895266
MLMG: Iteration 28 Fine resid/bnorm = 2.005579417e-12
MLMG: Iteration 28 Crse resid/bnorm = 0.01788886749
MLMG: Iteration 29 Fine resid/bnorm = 1.385191902e-13
MLMG: Iteration 29 Crse resid/bnorm = 0.01787805164
MLMG: Iteration 30 Fine resid/bnorm = 3.14270313e-12
MLMG: Iteration 30 Crse resid/bnorm = 0.0178834543
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 31 Fine resid/bnorm = 3.28548558e-12
MLMG: Iteration 31 Crse resid/bnorm = 0.01787303399
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 32 Fine resid/bnorm = 3.230135942e-12
MLMG: Iteration 32 Crse resid/bnorm = 0.01788467295
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 33 Fine resid/bnorm = 3.156100853e-12
MLMG: Iteration 33 Crse resid/bnorm = 0.01788489252
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 34 Fine resid/bnorm = 3.081448632e-12
MLMG: Iteration 34 Crse resid/bnorm = 0.01787796392
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 35 Fine resid/bnorm = 3.007326257e-12
MLMG: Iteration 35 Crse resid/bnorm = 0.01787782794
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 36 Fine resid/bnorm = 2.935984035e-12
MLMG: Iteration 36 Crse resid/bnorm = 0.01787782485
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 37 Fine resid/bnorm = 2.865849277e-12
MLMG: Iteration 37 Crse resid/bnorm = 0.01787294671
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 38 Fine resid/bnorm = 2.798755001e-12
MLMG: Iteration 38 Crse resid/bnorm = 0.01787811886
MLMG: Iteration 39 Fine resid/bnorm = 1.464744421e-12
MLMG: Iteration 39 Crse resid/bnorm = 0.01787865283
MLMG: Iteration 40 Fine resid/bnorm = 1.891242674e-12
MLMG: Iteration 40 Crse resid/bnorm = 0.01787937996
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 41 Fine resid/bnorm = 1.973565393e-12
MLMG: Iteration 41 Crse resid/bnorm = 0.01787298012
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 42 Fine resid/bnorm = 1.941662152e-12
MLMG: Iteration 42 Crse resid/bnorm = 0.01787300104
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 43 Fine resid/bnorm = 1.898929252e-12
MLMG: Iteration 43 Crse resid/bnorm = 0.017873055
MLMG: Iteration 44 Fine resid/bnorm = 1.232316368e-12
MLMG: Iteration 44 Crse resid/bnorm = 0.01787852775
MLMG: Iteration 45 Fine resid/bnorm = 1.714398555e-12
MLMG: Iteration 45 Crse resid/bnorm = 0.01787348297
MLMG: Iteration 46 Fine resid/bnorm = 9.002805586e-14
MLMG: Iteration 46 Crse resid/bnorm = 0.01787286996
MLMG: Iteration 47 Fine resid/bnorm = 2.82221786e-14
MLMG: Iteration 47 Crse resid/bnorm = 0.0178777298
MLMG: Iteration 48 Fine resid/bnorm = 3.003004806e-12
MLMG: Iteration 48 Crse resid/bnorm = 0.01787936012
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 49 Fine resid/bnorm = 3.137323521e-12
MLMG: Iteration 49 Crse resid/bnorm = 0.01788476414
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 50 Fine resid/bnorm = 3.091100689e-12
MLMG: Iteration 50 Crse resid/bnorm = 0.01788714677
MLMG: Iteration 51 Fine resid/bnorm = 1.551855651e-12
MLMG: Iteration 51 Crse resid/bnorm = 0.01787955199
MLMG: Iteration 52 Fine resid/bnorm = 2.030891852e-12
MLMG: Iteration 52 Crse resid/bnorm = 0.01787786083
MLMG: Iteration 53 Fine resid/bnorm = 1.185811003e-13
MLMG: Iteration 53 Crse resid/bnorm = 0.01787782542
MLMG: Iteration 54 Fine resid/bnorm = 2.747905981e-12
MLMG: Iteration 54 Crse resid/bnorm = 0.01787936216
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 55 Fine resid/bnorm = 2.709708154e-12
MLMG: Iteration 55 Crse resid/bnorm = 0.01787785774
MLMG: Iteration 56 Fine resid/bnorm = 1.351785648e-12
MLMG: Iteration 56 Crse resid/bnorm = 0.01787864724
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 57 Fine resid/bnorm = 1.376379021e-12
MLMG: Iteration 57 Crse resid/bnorm = 0.01787784289
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 58 Fine resid/bnorm = 1.351846136e-12
MLMG: Iteration 58 Crse resid/bnorm = 0.01787782532
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 59 Fine resid/bnorm = 1.322915232e-12
MLMG: Iteration 59 Crse resid/bnorm = 0.01787782483
MLMG: Iteration 60 Fine resid/bnorm = 3.455525941e-12
MLMG: Iteration 60 Crse resid/bnorm = 0.01788628357
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 61 Fine resid/bnorm = 3.537860911e-12
MLMG: Iteration 61 Crse resid/bnorm = 0.01788603893
MLMG: Iteration 62 Fine resid/bnorm = 2.154511215e-12
MLMG: Iteration 62 Crse resid/bnorm = 0.01787798669
MLMG: Iteration 63 Fine resid/bnorm = 1.304578162e-13
MLMG: Iteration 63 Crse resid/bnorm = 0.01787782822
MLMG: Iteration 64 Fine resid/bnorm = 1.465418978e-14
MLMG: Iteration 64 Crse resid/bnorm = 0.01787782462
MLMG: Iteration 65 Fine resid/bnorm = 1.124168693e-14
MLMG: Iteration 65 Crse resid/bnorm = 0.01787294584
MLMG: Iteration 66 Fine resid/bnorm = 7.215115409e-15
MLMG: Iteration 66 Crse resid/bnorm = 0.01787773615
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 67 Fine resid/bnorm = 7.196787144e-15
MLMG: Iteration 67 Crse resid/bnorm = 0.01788007489
MLMG: Iteration 68 Fine resid/bnorm = 2.608413622e-14
MLMG: Iteration 68 Crse resid/bnorm = 0.01787787559
MLMG: Iteration 69 Fine resid/bnorm = 1.268920833e-13
MLMG: Iteration 69 Crse resid/bnorm = 0.01787782574
MLMG: Iteration 70 Fine resid/bnorm = 5.934529569e-14
MLMG: Iteration 70 Crse resid/bnorm = 0.01787294656
MLMG: Iteration 71 Fine resid/bnorm = 2.371935932e-14
MLMG: Iteration 71 Crse resid/bnorm = 0.01787773583
MLMG: Iteration 72 Fine resid/bnorm = 1.981232829e-14
MLMG: Iteration 72 Crse resid/bnorm = 0.01787782253
MLMG: Iteration 73 Fine resid/bnorm = 1.784378647e-14
MLMG: Iteration 73 Crse resid/bnorm = 0.01787782448
MLMG: Iteration 74 Fine resid/bnorm = 1.577417599e-14
MLMG: Iteration 74 Crse resid/bnorm = 0.01787783286
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 75 Fine resid/bnorm = 1.472195172e-14
MLMG: Iteration 75 Crse resid/bnorm = 0.01787311883
MLMG: Iteration 76 Fine resid/bnorm = 2.57176628e-12
MLMG: Iteration 76 Crse resid/bnorm = 0.01787880959
MLMG: Iteration 77 Fine resid/bnorm = 1.205481874e-12
MLMG: Iteration 77 Crse resid/bnorm = 0.01787316656
MLMG: Iteration 78 Fine resid/bnorm = 6.95883745e-13
MLMG: Iteration 78 Crse resid/bnorm = 0.01788467394
MLMG: Iteration 79 Fine resid/bnorm = 1.121402704e-13
MLMG: Iteration 79 Crse resid/bnorm = 0.01787795867
MLMG: Iteration 80 Fine resid/bnorm = 2.807411685e-14
MLMG: Iteration 80 Crse resid/bnorm = 0.01787782755
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 81 Fine resid/bnorm = 2.310720483e-14
MLMG: Iteration 81 Crse resid/bnorm = 0.01788629447
MLMG: Iteration 82 Fine resid/bnorm = 3.203596997e-13
MLMG: Iteration 82 Crse resid/bnorm = 0.01787799248
MLMG: Iteration 83 Fine resid/bnorm = 1.347752089e-12
MLMG: Iteration 83 Crse resid/bnorm = 0.01787894036
MLMG: Iteration 84 Fine resid/bnorm = 1.875598279e-12
MLMG: Iteration 84 Crse resid/bnorm = 0.01787938449
MLMG: Iteration 85 Fine resid/bnorm = 3.467252968e-12
MLMG: Iteration 85 Crse resid/bnorm = 0.01787785312
MLMG: Iteration 86 Fine resid/bnorm = 1.61910842e-13
MLMG: Iteration 86 Crse resid/bnorm = 0.01787294289
MLMG: Iteration 87 Fine resid/bnorm = 2.41671241e-12
MLMG: Iteration 87 Crse resid/bnorm = 0.01788620696
MLMG: Iteration 88 Fine resid/bnorm = 2.710866232e-13
MLMG: Iteration 88 Crse resid/bnorm = 0.01787952809
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 89 Fine resid/bnorm = 2.688125784e-13
MLMG: Iteration 89 Crse resid/bnorm = 0.01787271123
MLMG: Iteration 90 Fine resid/bnorm = 1.394295738e-12
MLMG: Iteration 90 Crse resid/bnorm = 0.0178788482
MLMG: Iteration 91 Fine resid/bnorm = 1.193897609e-12
MLMG: Iteration 91 Crse resid/bnorm = 0.01788477827
MLMG: Iteration 92 Fine resid/bnorm = 1.825113814e-13
MLMG: Iteration 92 Crse resid/bnorm = 0.01787794673
MLMG: Iteration 93 Fine resid/bnorm = 3.025474972e-14
MLMG: Iteration 93 Crse resid/bnorm = 0.01787782761
MLMG: Iteration 94 Fine resid/bnorm = 2.962240447e-12
MLMG: Iteration 94 Crse resid/bnorm = 0.01787936209
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 95 Fine resid/bnorm = 3.100728245e-12
MLMG: Iteration 95 Crse resid/bnorm = 0.01787785774
MLMG: Iteration 96 Fine resid/bnorm = 1.396405929e-12
MLMG: Iteration 96 Crse resid/bnorm = 0.01787893771
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 97 Fine resid/bnorm = 1.306403523e-12
MLMG: Iteration 97 Crse resid/bnorm = 0.01787296948
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 98 Fine resid/bnorm = 1.309746445e-12
MLMG: Iteration 98 Crse resid/bnorm = 0.01787285825
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 99 Fine resid/bnorm = 1.289355844e-12
MLMG: Iteration 99 Crse resid/bnorm = 0.01787773427
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 100 Fine resid/bnorm = 1.263828351e-12
MLMG: Iteration 100 Crse resid/bnorm = 0.01787782264
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 101 Fine resid/bnorm = 1.23434157e-12
MLMG: Iteration 101 Crse resid/bnorm = 0.01787782463
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 102 Fine resid/bnorm = 1.207193916e-12
MLMG: Iteration 102 Crse resid/bnorm = 0.01787782464
MLMG: Iteration 103 Fine resid/bnorm = 2.738137542e-13
MLMG: Iteration 103 Crse resid/bnorm = 0.01787782454
MLMG: Iteration 104 Fine resid/bnorm = 5.367004168e-14
MLMG: Iteration 104 Crse resid/bnorm = 0.0178775522
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 105 Fine resid/bnorm = 5.294853971e-14
MLMG: Iteration 105 Crse resid/bnorm = 0.01787449382
MLMG: Iteration 106 Fine resid/bnorm = 5.030654182e-14
MLMG: Iteration 106 Crse resid/bnorm = 0.0178777665
MLMG: Iteration 107 Fine resid/bnorm = 1.484204732e-13
MLMG: Iteration 107 Crse resid/bnorm = 0.01787822016
MLMG: Iteration 108 Fine resid/bnorm = 7.001659934e-12
MLMG: Iteration 108 Crse resid/bnorm = 0.01788584473
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 109 Fine resid/bnorm = 7.21933962e-12
MLMG: Iteration 109 Crse resid/bnorm = 0.01787953222
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 110 Fine resid/bnorm = 7.084311127e-12
MLMG: Iteration 110 Crse resid/bnorm = 0.01787786253
MLMG: Iteration 111 Fine resid/bnorm = 2.959174698e-12
MLMG: Iteration 111 Crse resid/bnorm = 0.01787936305
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 112 Fine resid/bnorm = 3.067200249e-12
MLMG: Iteration 112 Crse resid/bnorm = 0.01787785776
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 113 Fine resid/bnorm = 3.029565955e-12
MLMG: Iteration 113 Crse resid/bnorm = 0.01787782597
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 114 Fine resid/bnorm = 2.968606241e-12
MLMG: Iteration 114 Crse resid/bnorm = 0.01787782462
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 115 Fine resid/bnorm = 2.900174337e-12
MLMG: Iteration 115 Crse resid/bnorm = 0.01787780239
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 116 Fine resid/bnorm = 2.829401775e-12
MLMG: Iteration 116 Crse resid/bnorm = 0.0178775525
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 117 Fine resid/bnorm = 2.76173248e-12
MLMG: Iteration 117 Crse resid/bnorm = 0.01787429843
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 118 Fine resid/bnorm = 2.694767603e-12
MLMG: Iteration 118 Crse resid/bnorm = 0.01787776272
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 119 Fine resid/bnorm = 2.631992481e-12
MLMG: Iteration 119 Crse resid/bnorm = 0.0178778233
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 120 Fine resid/bnorm = 2.571245623e-12
MLMG: Iteration 120 Crse resid/bnorm = 0.01788007727
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 121 Fine resid/bnorm = 2.51664864e-12
MLMG: Iteration 121 Crse resid/bnorm = 0.01787787558
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 122 Fine resid/bnorm = 2.46879799e-12
MLMG: Iteration 122 Crse resid/bnorm = 0.01787685507
MLMG: Iteration 123 Fine resid/bnorm = 1.400786336e-12
MLMG: Iteration 123 Crse resid/bnorm = 0.01787892081
MLMG: Iteration 124 Fine resid/bnorm = 1.63741027e-12
MLMG: Iteration 124 Crse resid/bnorm = 0.01787296994
MLMG: Iteration 125 Fine resid/bnorm = 7.471117727e-14
MLMG: Iteration 125 Crse resid/bnorm = 0.01787773658
MLMG: Iteration 126 Fine resid/bnorm = 3.228607661e-14
MLMG: Iteration 126 Crse resid/bnorm = 0.01787782255
MLMG: Iteration 127 Fine resid/bnorm = 3.698972775e-12
MLMG: Iteration 127 Crse resid/bnorm = 0.01787936208
MLMG: Iteration 128 Fine resid/bnorm = 2.936675436e-12
MLMG: Iteration 128 Crse resid/bnorm = 0.01787433599
MLMG: Iteration 129 Fine resid/bnorm = 1.359480073e-12
MLMG: Iteration 129 Crse resid/bnorm = 0.017879301
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 130 Fine resid/bnorm = 1.317434555e-12
MLMG: Iteration 130 Crse resid/bnorm = 0.01787297835
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 131 Fine resid/bnorm = 1.266002093e-12
MLMG: Iteration 131 Crse resid/bnorm = 0.01788467087
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 132 Fine resid/bnorm = 1.218271653e-12
MLMG: Iteration 132 Crse resid/bnorm = 0.01787795302
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 133 Fine resid/bnorm = 1.176062472e-12
MLMG: Iteration 133 Crse resid/bnorm = 0.01787782788
MLMG: Iteration 134 Fine resid/bnorm = 1.484070739e-12
MLMG: Iteration 134 Crse resid/bnorm = 0.01787893702
MLMG: Iteration 135 Fine resid/bnorm = 3.183126765e-14
MLMG: Iteration 135 Crse resid/bnorm = 0.01788264927
MLMG: Iteration 136 Fine resid/bnorm = 1.688581638e-12
MLMG: Iteration 136 Crse resid/bnorm = 0.01787792994
MLMG: Iteration 137 Fine resid/bnorm = 2.202664114e-13
MLMG: Iteration 137 Crse resid/bnorm = 0.01788884724
MLMG: Iteration 138 Fine resid/bnorm = 8.118799378e-14
MLMG: Iteration 138 Crse resid/bnorm = 0.01787453072
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 139 Fine resid/bnorm = 7.805496109e-14
MLMG: Iteration 139 Crse resid/bnorm = 0.01787888564
MLMG: Iteration 140 Fine resid/bnorm = 2.020067695e-13
MLMG: Iteration 140 Crse resid/bnorm = 0.01787784561
MLMG: Iteration 141 Fine resid/bnorm = 1.496696186e-13
MLMG: Iteration 141 Crse resid/bnorm = 0.01787782505
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 142 Fine resid/bnorm = 1.418957543e-13
MLMG: Iteration 142 Crse resid/bnorm = 0.0178786484
MLMG: Iteration 143 Fine resid/bnorm = 1.421040649e-13
MLMG: Iteration 143 Crse resid/bnorm = 0.01787784119
MLMG: Iteration 144 Fine resid/bnorm = 3.490094102e-13
MLMG: Iteration 144 Crse resid/bnorm = 0.01788475439
MLMG: Iteration 145 Fine resid/bnorm = 2.183553907e-13
MLMG: Iteration 145 Crse resid/bnorm = 0.01787796058
MLMG: Iteration 146 Fine resid/bnorm = 3.989054372e-12
MLMG: Iteration 146 Crse resid/bnorm = 0.01788856701
MLMG: Iteration 147 Fine resid/bnorm = 1.226244209e-12
MLMG: Iteration 147 Crse resid/bnorm = 0.01793223411
MLMG: Iteration 148 Fine resid/bnorm = 3.042942957e-12
MLMG: Iteration 148 Crse resid/bnorm = 0.01787886469
MLMG: Iteration 149 Fine resid/bnorm = 9.026924243e-14
MLMG: Iteration 149 Crse resid/bnorm = 0.01787296888
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 150 Fine resid/bnorm = 6.345982373e-14
MLMG: Iteration 150 Crse resid/bnorm = 0.01787435767
MLMG: Iteration 151 Fine resid/bnorm = 4.328724693e-14
MLMG: Iteration 151 Crse resid/bnorm = 0.01788469694
MLMG: Iteration 152 Fine resid/bnorm = 3.456356695e-14
MLMG: Iteration 152 Crse resid/bnorm = 0.0178779592
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 153 Fine resid/bnorm = 3.195836915e-14
MLMG: Iteration 153 Crse resid/bnorm = 0.01787936387
MLMG: Iteration 154 Fine resid/bnorm = 3.434313008e-13
MLMG: Iteration 154 Crse resid/bnorm = 0.01787785652
MLMG: Iteration 155 Fine resid/bnorm = 1.503857226e-12
MLMG: Iteration 155 Crse resid/bnorm = 0.01787893309
MLMG: Iteration 156 Fine resid/bnorm = 1.474168308e-12
MLMG: Iteration 156 Crse resid/bnorm = 0.01787784709
MLMG: Iteration 157 Fine resid/bnorm = 1.48779956e-13
MLMG: Iteration 157 Crse resid/bnorm = 0.01787294707
MLMG: Iteration 158 Fine resid/bnorm = 2.233000025e-14
MLMG: Iteration 158 Crse resid/bnorm = 0.01787773064
MLMG: Iteration 159 Fine resid/bnorm = 1.80161583e-14
MLMG: Iteration 159 Crse resid/bnorm = 0.01787294466
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 160 Fine resid/bnorm = 1.81144514e-14
MLMG: Iteration 160 Crse resid/bnorm = 0.01787927228
MLMG: Iteration 161 Fine resid/bnorm = 3.509517278e-13
MLMG: Iteration 161 Crse resid/bnorm = 0.01788478787
MLMG: Iteration 162 Fine resid/bnorm = 2.954479982e-13
MLMG: Iteration 162 Crse resid/bnorm = 0.0178779613
MLMG: Iteration 163 Fine resid/bnorm = 6.336857099e-12
MLMG: Iteration 163 Crse resid/bnorm = 0.01789193808
MLMG: Iteration 164 Fine resid/bnorm = 5.738945636e-12
MLMG: Iteration 164 Crse resid/bnorm = 0.01787810343
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 165 Fine resid/bnorm = 5.834140109e-12
MLMG: Iteration 165 Crse resid/bnorm = 0.01787936827
MLMG: Iteration 166 Fine resid/bnorm = 1.713827746e-12
MLMG: Iteration 166 Crse resid/bnorm = 0.01787312027
MLMG: Iteration 167 Fine resid/bnorm = 1.224733587e-12
MLMG: Iteration 167 Crse resid/bnorm = 0.01787885203
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 169 Fine resid/bnorm = 1.178231285e-12
MLMG: Iteration 169 Crse resid/bnorm = 0.01787297988
MLMG: Iteration 170 Fine resid/bnorm = 3.374586025e-12
MLMG: Iteration 170 Crse resid/bnorm = 0.01787927495
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 171 Fine resid/bnorm = 3.536430636e-12
MLMG: Iteration 171 Crse resid/bnorm = 0.01787297775
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 172 Fine resid/bnorm = 3.472631046e-12
MLMG: Iteration 172 Crse resid/bnorm = 0.01787773741
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 173 Fine resid/bnorm = 3.396518689e-12
MLMG: Iteration 173 Crse resid/bnorm = 0.01788007562
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 174 Fine resid/bnorm = 3.319256294e-12
MLMG: Iteration 174 Crse resid/bnorm = 0.01787787563
MLMG: Iteration 175 Fine resid/bnorm = 1.43151274e-12
MLMG: Iteration 175 Crse resid/bnorm = 0.01788079541
MLMG: Iteration 176 Fine resid/bnorm = 2.043819596e-12
MLMG: Iteration 176 Crse resid/bnorm = 0.01787789081
MLMG: Iteration 177 Fine resid/bnorm = 4.745514995e-12
MLMG: Iteration 177 Crse resid/bnorm = 0.01793850996
MLMG: Iteration 178 Fine resid/bnorm = 5.700713353e-12
MLMG: Iteration 178 Crse resid/bnorm = 0.01787410392
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 179 Fine resid/bnorm = 5.871527808e-12
MLMG: Iteration 179 Crse resid/bnorm = 0.01787887286
MLMG: Iteration 180 Fine resid/bnorm = 6.283989768e-13
MLMG: Iteration 180 Crse resid/bnorm = 0.0178847788
MLMG: Iteration 181 Fine resid/bnorm = 1.615490622e-13
MLMG: Iteration 181 Crse resid/bnorm = 0.01787308401
MLMG: Iteration 182 Fine resid/bnorm = 2.862311108e-12
MLMG: Iteration 182 Crse resid/bnorm = 0.01788620955
MLMG: Iteration 183 Fine resid/bnorm = 1.153170421e-13
MLMG: Iteration 183 Crse resid/bnorm = 0.01787952824
MLMG: Iteration 184 Fine resid/bnorm = 3.304176391e-12
MLMG: Iteration 184 Crse resid/bnorm = 0.01787786042
MLMG: Iteration 185 Fine resid/bnorm = 3.238101703e-12
MLMG: Iteration 185 Crse resid/bnorm = 0.01787936302
MLMG: Iteration 186 Fine resid/bnorm = 2.538988068e-12
MLMG: Iteration 186 Crse resid/bnorm = 0.01787785645
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 187 Fine resid/bnorm = 2.84285386e-12
MLMG: Iteration 187 Crse resid/bnorm = 0.01788007925
MLMG: Iteration 188 Fine resid/bnorm = 2.889767328e-13
MLMG: Iteration 188 Crse resid/bnorm = 0.01788480878
MLMG: Iteration 189 Fine resid/bnorm = 2.383486089e-13
MLMG: Iteration 189 Crse resid/bnorm = 0.01788204861
MLMG: Iteration 190 Fine resid/bnorm = 2.152865787e-13
MLMG: Iteration 190 Crse resid/bnorm = 0.01787791889
MLMG: Iteration 191 Fine resid/bnorm = 6.015901324e-14
MLMG: Iteration 191 Crse resid/bnorm = 0.01787782691
MLMG: Iteration 192 Fine resid/bnorm = 2.895272507e-14
MLMG: Iteration 192 Crse resid/bnorm = 0.01787294658
MLMG: Iteration 193 Fine resid/bnorm = 2.123705182e-14
MLMG: Iteration 193 Crse resid/bnorm = 0.0178777196
MLMG: Iteration 194 Fine resid/bnorm = 1.527093829e-14
MLMG: Iteration 194 Crse resid/bnorm = 0.01788884259
MLMG: Iteration 195 Fine resid/bnorm = 3.0959338e-12
MLMG: Iteration 195 Crse resid/bnorm = 0.01788367561
MLMG: Iteration 196 Fine resid/bnorm = 1.84887117e-13
MLMG: Iteration 196 Crse resid/bnorm = 0.01787948996
MLMG: Iteration 197 Fine resid/bnorm = 3.430498432e-12
MLMG: Iteration 197 Crse resid/bnorm = 0.0178729816
MLMG: Iteration 198 Fine resid/bnorm = 1.228458914e-13
MLMG: Iteration 198 Crse resid/bnorm = 0.01787773691
MLMG: Iteration 199 Fine resid/bnorm = 4.128692974e-14
MLMG: Iteration 199 Crse resid/bnorm = 0.01787782261
MLMG: Bottom solve failed.
MLMG: Bottom solve failed.
MLMG: Iteration 200 Fine resid/bnorm = 3.439665053e-14
MLMG: Iteration 200 Crse resid/bnorm = 0.01789094759
MLMG: Timers: Solve = 31.88271638 Iter = 31.85704381 Bottom = 17.38664955
>> After projection:
* On lev 0 max(abs(rhs)) = 0.05281370431
* On lev 1 max(abs(rhs)) = 0.08601693711
* On lev 2 max(abs(rhs)) = 0.06112537326
* On lev 3 max(abs(rhs)) = 0.06269878415
Nodal_projection 200 0.1416014512 0.002533384142
This is run with 9eb5e619dda5dbe6cd5c5683a3caae771772d19e, which is based off of the b/awaken-runs branch that Phil put together.
Now, what's super interesting is that I've got a production run (same ABL setup, but with OpenFAST turbines) going right now, running simultaneously with this debug case on Summit, and using the same exe, but that one so far has no issues with the bottom solver (knock on wood). I'm not sure how to explain any of this behavior.
Lawrence
I recently re-ran the simple case Lawrence mentioned previously (https://github.com/lawrenceccheung/AWAKEN_summit_setup/blob/main/precursor/StableABL1/KingPlains_stable_precursor9.inp) on Summit (with GPUs), but with two added levels of refinement. With the latest amr-wind build, neither the nodal or MAC projections maxed out.
@alhs6577 would you comment on your build process? It would be good to see if we can reproduce this. @lawrenceccheung has seen builds that work for a set of runs and then suddenly stop converging so there appears to be an intermittent nature to this issue.
@psakievich I used one of the latest Summit builds from @lawrenceccheung so he would be the person to ask.
Oh that build that @alhs6577 is from commit 4b71037, and compiled using spack-manager.
In case it helps, I put together a log of different runs I was doing to try and bisect the case. I will keep adding to this list with more data points.
Date | Commit | Job type | Result |
---|---|---|---|
June 18 | a75d2ec | ABL only, bndry I/O, no turbines | failed |
June 19 | a75d2ec | ABL only, bndry I/O, no turbines | failed |
June 19 | f92aae1 | ABL only, bndry I/O, no turbines | works |
June 19 | 185c360a | ABL only, bndry I/O, no turbines | works |
June 21 | 257c13c1 | ABL only, bndry I/O, no turbines | works |
June 21 | 9eb5e619 | ABL only, bndry I/O, no turbines | works |
June 22 | 9eb5e619 | ABL only, bndry I/O, no turbines | works |
June 24 | 9eb5e619 | ABL only, bndry I/O, no turbines | failed |
June 24 | 9eb5e619 | Production run w/turbines | works |
June 27 | f92aae1 | ABL Production run, no turbines | works |
June 26 | 4b71037 | ABL only, bndry I/O, no turbines | failed |
June 27 | bbe0fddb | ABL only, bndry I/O, no turbines | works |
June 27 | 9eb5e619 | Production run w/turbines | failed |
June 28 | 9eb5e61 | ABL only, bndry I/O, no turbines, with nodal verbose output | failed |
June 28 | 9eb5e61 | Production run w/turbines | works |
June 29 | 4b71037 | Periodic ABL, 2 levels of refinement, no turbines | works |
June 30 | 9537522d | Production run w/turbines and radar | failed |
June 30 | 9cb0abaa | Production run w/turbines and radar (debug) | works |
July 6 | 9cb0abaa | Production run w/turbines and radar | works |
July 7 | e97f8472 | Debug run, CPU only | failed (see note below) |
@asalmgren @PaulMullowney @psakievich it occurred to me that we might have a way to determine if this is a software or a hardware issue. I have an old executable from commit f92aae1, compiled on April 7, which previously hasn't shown any issues with the nodal projections. We can run a debug test case with this exe many times (say 10 times), and if there aren't any issues with the bottom solver on Summit, then something must have happened to code itself to cause these changes.
Lawrence
Just to be clear -- when you say "hardware" I'm assuming you're including the system software, e.g. version of ROCm, etc?
On Thu, Jun 29, 2023 at 1:55 PM lawrenceccheung @.***> wrote:
@asalmgren https://github.com/asalmgren @PaulMullowney https://github.com/PaulMullowney @psakievich https://github.com/psakievich it occurred to me that we might have a way to determine if this is a software or a hardware issue. I have an old executable from commit f92aae1 https://github.com/Exawind/amr-wind/commit/f92aae1b2edb4fa8d98eaaf16def30754bc3b5f8, compiled on April 7, which previously hasn't shown any issues with the nodal projections. We can run a debug test case with this exe many times (say 10 times), and if there aren't any issues with the bottom solver on Summit, then something must have happened to code itself to cause these changes.
Lawrence
— Reply to this email directly, view it on GitHub https://github.com/Exawind/amr-wind/issues/859#issuecomment-1613798581, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACRE6YURHNL2OYZQUJCFR2LXNXTUPANCNFSM6AAAAAAZMZNPE4 . You are receiving this because you were mentioned.Message ID: @.***>
-- Ann Almgren Senior Scientist; Dept. Head, Applied Mathematics Pronouns: she/her/hers
yes, I'm including system software when I say hardware. Although, I haven't changed the way I compile amr-wind in the last 6 months or so -- they should all be using gcc 10.2.0 toolset (see https://github.com/sandialabs/spack-manager/blob/main/configs/summit/compilers.yaml) -- so hopefully that's not a factor
Another interesting data point. I just tried a run using 9537522, which is based off the most recent branch 4b71037 with additional radar scan functionality. That failed, but the bottom solver failed differently than before.
Normally when the nodal projections max out, it does so consistently after the first few iterations, like so:
$ grep Nodal_projection wturbs.3018138
Nodal_projection 8 0.3642743485 1.249139105e-07
Nodal_projection 7 0.4217680262 1.907383986e-07
Nodal_projection 7 0.4151295041 1.40615603e-07
Nodal_projection 7 0.4089079335 1.485313398e-07
Nodal_projection 200 0.4104877735 0.004535525282
Nodal_projection 200 0.4062099661 0.003023266983
Nodal_projection 200 0.3979261111 0.001511616422
Nodal_projection 200 0.3886770588 0.008969860162
Nodal_projection 200 0.3817778914 0.004502683967
...
However, with this radar functionality built in, it will fail intermittently:
Nodal_projection 8 0.3642743489 1.249344916e-07
Nodal_projection 7 0.4217680263 1.908786274e-07
Nodal_projection 7 0.4151295047 1.407023882e-07
Nodal_projection 7 0.4089079343 1.486727061e-07
Nodal_projection 7 0.4104877746 1.586428909e-07
Nodal_projection 200 0.4062096545 0.001289872447
Nodal_projection 200 0.397925796 0.002579744251
Nodal_projection 7 0.3886768938 1.959483487e-07
Nodal_projection 200 0.3817772175 0.003831758237
Nodal_projection 7 0.3772330206 2.150512131e-07
Nodal_projection 200 0.3746421235 0.005678088441
Nodal_projection 7 0.37307631 2.163975566e-07
Nodal_projection 200 0.3712630136 0.004975038183
Nodal_projection 200 0.3692193623 0.009950076829
Not sure what to make of that either, but it case it helps anything.
Latest run with CPU's only (July 7 with e97f8472 above), also failed, but please note -- it failed (as in core dumped) on the first MAC projection step, not on the nodal projection step. So something else is going on too.
Lawrence
Hey Lawrence -- can you turn on the verbosity and see where in the MAC it failed?