E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

bug fix in modal_aero_wateruptake_dr

Open whannah1 opened this issue 2 years ago • 5 comments

This fixes a confusing bug that only occurs when running E3SM-MMF on many nodes (~1000) on Summit. I verified that adding explicit bounds to the value initialization in modal_aero_wateruptake_dr fixes the problem, but for certain tests where the I varied the CRM workload and node count to trigger the problem the array bounds in this module should be the same either way, so it's unclear why this fix works.

[BFB]

whannah1 avatar Sep 14 '22 16:09 whannah1

I have seen this in the past. What error message do you get when it crashes? I think initializing explicitly is fine and clearer. While you are at it, would you please declare a parameter like: real(r8), parameter :: huge_real = huge(1.0_r8)

and replace all the huge(1.0_r8) in this file?

singhbalwinder avatar Sep 14 '22 16:09 singhbalwinder

@singhbalwinder The error tends to look like this:

1: 1395: Backtrace for this error:
1: 1395:
1: 1395: Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
1: 1395:
1: 1395: Backtrace for this error:
1: 1395: #0  0x2000000504d7 in ???
1: 1395: #0  0x2000000504d7 in ???
1: 1395: #0  0x2000000504d7 in ???
1: 1395: #0  0x2000000504d7 in ???
1: 1395: #0  0x2000000504d7 in ???
1: 1395: #1  0x1088bdd4 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
1: 1395: 	at /autofs/nccs-svm1_home1/hannah6/E3SM/E3SM_SRC1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:275
1: 1395: #1  0x1088bdd4 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
1: 1395: 	at /autofs/nccs-svm1_home1/hannah6/E3SM/E3SM_SRC1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:275
1: 1395: #1  0x1088bdd4 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
1: 1395: #1  0x1088bdd4 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
1: 1395: 	at /autofs/nccs-svm1_home1/hannah6/E3SM/E3SM_SRC1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:275
1: 1395: 	at /autofs/nccs-svm1_home1/hannah6/E3SM/E3SM_SRC1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:275
1: 1395: #1  0x1088bdd4 in __modal_aero_wateruptake_MOD_modal_aero_wateruptake_dr
1: 1395: 	at /autofs/nccs-svm1_home1/hannah6/E3SM/E3SM_SRC1/components/eam/src/chemistry/utils/modal_aero_wateruptake.F90:275
1: 1395: #2  0x1065c0ab in __aero_model_MOD_aero_model_wetdep

I can create that parameter as well.

whannah1 avatar Sep 14 '22 17:09 whannah1

In this issue, we had errors right around that same line number. https://github.com/E3SM-Project/scream/issues/1317

ndkeen avatar Sep 14 '22 21:09 ndkeen

This test passed last night on Summit: SMS_Ln9.ne4pg2_ne4pg2.F2010-MMF1.summit_gnugpu

whannah1 avatar Sep 15 '22 15:09 whannah1

Looks like we need a more invasive fix for this problem. While working get a better load balance on some larger runs (~1k nodes) I was advised to reduce the thread count on the ocean and ice components to 2, but I still used 7 threads in the atmosphere. It turns out this causes previously allocated module variables in the atmosphere to be deallocated on 5 of the 7 threads that were not used for the sea ice model. Making all the components use the same thread count can fix the problem, but since this only applies to 2 files in the atmosphere I was also able to fix it by converting them to allocatable variables within the subroutine that they are used. This fix means that we will allocate and deallocate this memory each time this routine is called, which is what the current code is meant to avoid, but it seems that this makes the code more robust so I think it's worth the extra cost of those added allocations.

Also, it's important to note that this is not the whole story because there were other cases with this same error that did not have inconsistent thread counts. The only commonality between all these runs is that they use the MMF on the GPU and use a large number of nodes. I can reproduce the error in a small case (ne30pg2 atmos + coupled ocean on 16 nodes) when the threads are inconsistent, but I'm not sure how to reproduce the problem when the threading is consistent.

whannah1 avatar Sep 20 '22 16:09 whannah1

telecon notes: Walter and Balwinder are working on it.

rljacob avatar Oct 06 '22 17:10 rljacob

telecon notes: might be closed.

rljacob avatar Nov 03 '22 17:11 rljacob

Closing this for now. Might revisit later, but in the mean time it seems this is a widespread problem in the atmosphere code that can be triggered by using different thread counts across model components (see issue #5319).

whannah1 avatar Nov 22 '22 15:11 whannah1