E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

Fortran runtime error: Index '1' of dimension 2 of array 'this' outside of expected range SMS_D.f19_g16.I1850ELM.machine_compiler.elm-betr with invalid

Open ndkeen opened this issue 2 years ago • 14 comments

As we closed https://github.com/E3SM-Project/E3SM/issues/5539, I'm making another issue here with same error. We are trying to add the invalid check to the fortran compiler.

With SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr:

 3: At line 124 of file /global/cfs/cdirs/e3sm/ndk/repos/ndk_mf_gnu-add-invalid-to-DEBUG/components/elm/src/external_models/sbetr/src/betr/betr_core/TracerStateType.F90
 3: Fortran runtime error: Index '1' of dimension 2 of array 'this' outside of expected range (140737046949536:40202912)

/pscratch/sd/n/ndk/e3sm_scratch/pm-cpu/mfgnuinvalid/SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr.gh5539

To add invalid check:

login04% git diff cime_config/machines/cmake_macros/gnu.cmake
diff --git a/cime_config/machines/cmake_macros/gnu.cmake b/cime_config/machines/cmake_macros/gnu.cmake
index eae59e3e4b..a8fce54cbf 100644
--- a/cime_config/machines/cmake_macros/gnu.cmake
+++ b/cime_config/machines/cmake_macros/gnu.cmake
@@ -19,7 +19,8 @@ endif()
 if (DEBUG)
   string(APPEND CFLAGS " -g -Wall -fbacktrace -fcheck=bounds -ffpe-trap=invalid,zero,overflow")
   string(APPEND CXXFLAGS " -g -Wall -fbacktrace")
-  string(APPEND FFLAGS " -g -Wall -fbacktrace -fcheck=bounds -ffpe-trap=zero,overflow")
+  string(APPEND FFLAGS " -g -Wall -fbacktrace -fcheck=bounds,pointer -ffpe-trap=invalid,zero,overflow")

ndkeen avatar Jul 24 '23 18:07 ndkeen

@ndkeen I fixed the issue with branch jinyuntang/fix5832, could you do a test? The problem is an array size inconsistency between elm and sbetr. A small update of sbetr fixed the problem as far as I can tell from my test.

jinyun1tang avatar Aug 18 '23 22:08 jinyun1tang

When I add invalid flag to recent master and try the test, I now see a different error mesg that reported above.

 95: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
 95:
 95: Backtrace for this error:
 95: #0  0x14f0c72dedbf in ???
 95: #1  0x1f604a7 in __tracerparamsmod_MOD_calc_aerecond
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/betr/betr_para/TracerParamsMod.F90:1271
 95: #2  0x1f4c777 in __betrbgcmod_MOD_stage_tracer_transport
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/betr/betr_main/BetrBGCMod.F90:203
 95: #3  0x1e420e8 in __betrtype_MOD_step_without_drainage
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/driver/shared/BeTRType.F90:375
 95: #4  0x1b25651 in __betrsimulationelm_MOD_elmstepwithoutdrainage
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/external_models/sbetr/src/driver/elm/BeTRSimulationELM.F90:314
 95: #5  0x6862a8 in __elm_driver_MOD_elm_drv
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/main/elm_driver.F90:1178
 95: #6  0x6509c7 in __lnd_comp_mct_MOD_lnd_run_mct
 95:    at /global/cfs/cdirs/e3sm/ndk/repos/me24-aug15/components/elm/src/cpl/lnd_comp_mct.F90:514

If I check out your branch, add invalid, I do not see a crash. However, I'm not sure what changes you made based on the branch.

ndkeen avatar Aug 19 '23 00:08 ndkeen

@ndkeen the problem is due to a more recent update of maxpft from a small number to a larger number 50, causing a mistmatch between sbetr and elm. If you find my fix solve the problem, I will update sbetr, and update e3sm and create a pull request based on this.

jinyun1tang avatar Aug 19 '23 00:08 jinyun1tang

Note above, I show how to add invalid check, so you can try yourself. Then go ahead and make PR.

ndkeen avatar Aug 19 '23 00:08 ndkeen

When I tested the branch "jinyuntang/fix5832", I included invalid check. But I did not include that change in the push to branch jinyuntang/fix5832. For creating the PR, do I have to also include the invalid check made to " /cime_config/machines/cmake_macros/gnu.cmake"?

On Fri, Aug 18, 2023 at 5:21 PM noel @.***> wrote:

Note above, I show how to add invalid check, so you can try yourself. Then go ahead and make PR.

— Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/E3SM/issues/5832#issuecomment-1684574351, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTQV3Q5JE3BD46725TFHYDXWABKFANCNFSM6AAAAAA2V55MB4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Jinyun (He/him) Staff Scientist Lawrence Berkeley National Laboratory 1 Cyclotron Rd., MS 74R316C Berkeley, CA 94720 tel: 510 486-5792, fax: 510 486-7070

jinyuntang avatar Aug 19 '23 00:08 jinyuntang

Great! Then sounds like you have fixed this issue. You would not want to include that change in your PR -- we would like to add it, but are still trying to fix issues that were uncovered with it (like this one).

ndkeen avatar Aug 19 '23 01:08 ndkeen

Great! I will create a PR then.

On Fri, Aug 18, 2023 at 6:29 PM noel @.***> wrote:

Great! Then sounds like you have fixed this issue. You would not want to include that change in your PR -- we would like to add it, but are still trying to fix issues that were uncovered with it (like this one).

— Reply to this email directly, view it on GitHub https://github.com/E3SM-Project/E3SM/issues/5832#issuecomment-1684643464, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACTQV3QXN3F7Q3WSCYPYUXLXWAJINANCNFSM6AAAAAA2V55MB4 . You are receiving this because you commented.Message ID: @.***>

-- Jinyun (He/him) Staff Scientist Lawrence Berkeley National Laboratory 1 Cyclotron Rd., MS 74R316C Berkeley, CA 94720 tel: 510 486-5792, fax: 510 486-7070

jinyuntang avatar Aug 19 '23 01:08 jinyuntang

With Oct27th checkout, I still see this error

ndkeen avatar Oct 27 '23 22:10 ndkeen

I'm still seeing the same error with Jan18th master and Jan23rd master

ndkeen avatar Jan 19 '24 05:01 ndkeen

@ndkeen Is there any change I'd made to do test? I recall last time you instructed me to made some changes in time. Now, after I trying ./create_test SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr I got the following error "FAIL SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr (phase CREATE_NEWCASE)". I have no clue what is going on. Thanks.

jinyun1tang avatar Feb 06 '24 19:02 jinyun1tang

Yes that is correct command. I don't have enough info there to know what's wrong, but if I were to guess: Are you trying that on perlmutter? If on another machine, need the machine name instead of pm-cpu. Are you trying from cime/scripts? I guess so as it would otherwise say create_test not found.

When I try this test on master: create_test SMS_D.f19_g16.I1850ELM.pm-cpu_gnu.elm-betr I still see the same error as noted above

Note that the change I mention above (regarding compiler flags) should no longer be needed as master has this change (for quite a while).

ndkeen avatar Feb 06 '24 19:02 ndkeen

@ndkeen It appeared I have to update the submodules. After that, now it is working. I will report back the result once it is done.

jinyun1tang avatar Feb 06 '24 19:02 jinyun1tang

Ah, yep, that's another common mistake I should have mentioned

ndkeen avatar Feb 06 '24 19:02 ndkeen

@ndkeen, just let you know that the tests passed.

jinyuntang avatar Feb 06 '24 19:02 jinyuntang