CAM
CAM copied to clipboard
Newest ccs_config tag causes the derecho_intel SCT test to fail.
What happened?
When running with the ccs_config tag updated to ccs_config_cesm0.0.99
the SCT test that is part of the aux_cam test suite on Derecho failed. This test seems to do a 3D cam run set up to output IOP data and then runs the SCAM model with that output data and checks that the resulting forced fields are within round-off values of each other. With the ccs_config_cesm0.0.85
tag, this works. With the ccs_config_cesm0.0.99
tag, the 3D run finishes successfully, but the SCAM simulation hangs very early on. This is very reproducible for me and doesn't seem to be a flacky machine result.
What are the steps to reproduce the bug?
On Derecho
Check out cam6_3_158 or later (my testing done with what will become cam6_3_159).
Change the [ccs_config] tag in Externals.cgf to ccs_config_cesm0.0.99
Run ./manage_externals/checkout_externals
go to cime/scripts
run qcmd -- ./create_test SCT_D_Ln7.T42_T42_mg17.QPC5.derecho_intel.cam-scm_prep
View results in /glade/derecho/scratch/[user]
space
What CAM tag were you using?
cam6_3_159
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
/glade/derecho/scratch/katec/aux_cam_20240423154530/SCT_D_Ln7.T42_T42_mg17.QPC5.derecho_intel.cam-scm_prep.GC.aux_cam_20240423154530
Will you be addressing this bug yourself?
No
Extra info
This ccs_config tag may already be part of the last alpha tag and so I have added this test to alpha testing to ensure that this test is successful in the next beta tag.
It turns out that the SCM test only uses one MPI task. In the ccs_config0.0.85
, it used the raw mpiexec
command (https://github.com/ESMCI/ccs_config_cesm/blob/ccs_config_cesm0.0.85/machines/config_machines.xml#L1283-L1288) and it worked fine with -n 1
option. In the ccs_config0.0.99
, it has been updated to use the wrapper script mpibind
(https://github.com/ESMCI/ccs_config_cesm/blob/ccs_config_cesm0.0.99/machines/derecho/config_machines.xml#L24-L29), which eases the runtime setup for an MPI/OpenMP hybrid job. By switching back to the raw mpiexec
command, I could confirm that that ccs_config0.0.99
also passes the SCM test on Derecho, following Kate's instructions above.
@jedwards4b found out that there was an issue between ESMF and mpibind
when only using a single MPI task.
The issue is that the first of the two jobs run in the SCT test uses the entire node (128 tasks). The second only uses one - but esmf is initializing as if it were 128, I'm working with Rory to find a solution.