CAM icon indicating copy to clipboard operation
CAM copied to clipboard

Bug in FSCAM with GNU compilers in DEBUG mode

Open briandobbins opened this issue 4 years ago • 13 comments

Running SCAM with the GNU compilers with DEBUG=TRUE results in an error in CESM 2.2, but works in CESM 2.1.3.

Tested on Cheyenne:

CESM 2.1.3, Intel compiler, DEBUG=FALSE - works fine CESM 2.1.3, Intel compiler, DEBUG=TRUE - works fine CESM 2.1.3, GNU compiler, DEBUG=FALSE - works fine CESM 2.1.3, GNU compiler, DEBUG=TRUE - works fine

CESM 2.2.0, Intel compiler, DEBUG=FALSE - works fine CESM 2.2.0, Intel compiler, DEBUG=TRUE - works fine CESM 2.2.0, GNU compiler, DEBUG=FALSE - works fine CESM 2.2.0, GNU compiler, DEBUG=TRUE - fail, with the message below

Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.

Backtrace for this error:
#0  0x7efc67dc994f in ???
#1  0xf47df0 in __micro_mg3_0_MOD_micro_mg_tend
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/pumas/micro_mg3_0.F90:1969
#2  0xc531c9 in micro_mg_cam_tend_pack
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/micro_mg_cam.F90:2517
#3  0xc710dc in __micro_mg_cam_MOD_micro_mg_cam_tend
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/micro_mg_cam.F90:1310
#4  0xc98a91 in __microp_driver_MOD_microp_driver_tend
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/microp_driver.F90:189
#5  0x662a1d in tphysbc
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/physpkg.F90:2473
#6  0x670449 in __physpkg_MOD_phys_run1
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/physics/cam/physpkg.F90:1073
#7  0x4fa1b3 in __cam_comp_MOD_cam_run1
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/control/cam_comp.F90:259
#8  0x4f4465 in __atm_comp_mct_MOD_atm_init_mct
	at /glade/scratch/bdobbins/scam/cesm2.2.0/components/cam/src/cpl/mct/atm_comp_mct.F90:354
#9  0x427e14 in __component_mod_MOD_component_init_cc
	at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/component_mod.F90:248
#10  0x41e9a6 in __cime_comp_mod_MOD_cime_init
	at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/cime_comp_mod.F90:2209
#11  0x4243b5 in cime_driver
	at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/cime_driver.F90:122
#12  0x424524 in main
	at /glade/scratch/bdobbins/scam/cesm2.2.0/cime/src/drivers/mct/main/cime_driver.F90:23

To reproduce the failure in the CESM 2.2 release, do:

export CESM22ROOT=<path to CESM 2.2 checkout> ${CESM22ROOT}/cime/scripts/create_newcase --compset FSCAM --res T42_T42 --compiler gnu --case foo --user-mods-dir ${CESM22ROOT}/components/cam/cime_config/usermods_dirs/scam_arm97 --run-unsupported cd foo ./xmlchange DEBUG=TRUE,PIO_TYPENAME=netcdf,STOP_N=1,STOP_OPTION=ndays ./case.setup ./case.build ./case.submit

I've not tested other IOPs, just arm97. I'm going to dig into this at some point, but I'm not familiar with the SCAM code base, so I thought others might have a quick solution or at least ideas.

briandobbins avatar Oct 29 '20 16:10 briandobbins

@jtruesdal @Katetc This is an error in MG3 using SCAM. I've assigned both of you since I'm not sure which code base is the one responsible for the error.

cacraigucar avatar Oct 29 '20 16:10 cacraigucar

That's a great stack trace. It points to this line in MG3:

       if (lamr(i,k) > qsmall .and. 1._r8/lamr(i,k) < Dcs) then

Which is probably the same issue as Steve has added to the PUMAS repo here: https://github.com/ESCOMP/PUMAS/issues/8 "Invalid code logic tripping up some compilers"

So, we are aware of the general issue in PUMAS, and glad to have a simple case that reproduces the problem here! Also tagging @andrewgettelman .

Katetc avatar Oct 29 '20 16:10 Katetc

Also, you can leave me as the main assignee. I'll fix this and add a test for it going forward when we tackle the PUMAS issue.

Katetc avatar Oct 29 '20 16:10 Katetc

Thanks Brian! I mentioned this to Hugh as well.

I'm happy to try to help fix this if needed. So it's ever .and. and .or. conditional? Or just those that might trigger a divide by zero error?

andrewgettelman avatar Oct 29 '20 17:10 andrewgettelman

FYI, the fix Kate mentions works for this case.

Do we want to make a PR specifically for this, or allow the larger PUMAS issue to tackle it?

briandobbins avatar Oct 29 '20 17:10 briandobbins

Thanks Brian! I mentioned this to Hugh as well.

I'm happy to try to help fix this if needed. So it's ever .and. and .or. conditional? Or just those that might trigger a divide by zero error?

The way to figure out whether the .and. or .or. needs to be split is to look at each section and see if it can always be evaluated independently without any other section. If not, then it needs to be contained in its own if statement with an outer if statement to eliminate the invalid condition(s).

cacraigucar avatar Oct 29 '20 17:10 cacraigucar

Tagging @hmorrison100 on this as well so he sees it.

andrewgettelman avatar Oct 29 '20 17:10 andrewgettelman

I believe that the PUMAS issue for this is ESCOMP/PUMAS#8. Keeping this issue open so that when the fix is tagged in PUMAS, we can update the Externals_CAM.cfg file.

gold2718 avatar Dec 29 '20 22:12 gold2718

@katec - Has this issue been addressed and should it be closed?

cacraigucar avatar Apr 07 '23 17:04 cacraigucar

@katec - Has this issue been addressed and should it be closed?

cacraigucar avatar Apr 07 '23 17:04 cacraigucar

@katec - Has this issue been addressed and should it be closed?

cacraigucar avatar Apr 07 '23 17:04 cacraigucar

Yes, it was fixed in pumas tag pumas_cam-release_v1.13 and cam tag 6_3_017.

Katetc avatar Apr 07 '23 19:04 Katetc

@Katetc - We are revisiting this, and I see that the original question says that it was an error in the cesm2_2 branch. I see that that branch is using puams_cam-releasev1.3, so it probably isn't fixed for that branch. Should it be and if so, can we just jump to v1.13 or will it require some work from someone to take that big a leap with pumas?

cacraigucar avatar May 06 '24 19:05 cacraigucar