CAM icon indicating copy to clipboard operation
CAM copied to clipboard

Newest ccs_config tag causes the derecho_intel SCT test to fail.

Open Katetc opened this issue 9 months ago • 3 comments

What happened?

When running with the ccs_config tag updated to ccs_config_cesm0.0.99 the SCT test that is part of the aux_cam test suite on Derecho failed. This test seems to do a 3D cam run set up to output IOP data and then runs the SCAM model with that output data and checks that the resulting forced fields are within round-off values of each other. With the ccs_config_cesm0.0.85 tag, this works. With the ccs_config_cesm0.0.99 tag, the 3D run finishes successfully, but the SCAM simulation hangs very early on. This is very reproducible for me and doesn't seem to be a flacky machine result.

What are the steps to reproduce the bug?

On Derecho Check out cam6_3_158 or later (my testing done with what will become cam6_3_159). Change the [ccs_config] tag in Externals.cgf to ccs_config_cesm0.0.99 Run ./manage_externals/checkout_externals go to cime/scripts run qcmd -- ./create_test SCT_D_Ln7.T42_T42_mg17.QPC5.derecho_intel.cam-scm_prep View results in /glade/derecho/scratch/[user] space

What CAM tag were you using?

cam6_3_159

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/katec/aux_cam_20240423154530/SCT_D_Ln7.T42_T42_mg17.QPC5.derecho_intel.cam-scm_prep.GC.aux_cam_20240423154530

Will you be addressing this bug yourself?

No

Extra info

This ccs_config tag may already be part of the last alpha tag and so I have added this test to alpha testing to ensure that this test is successful in the next beta tag.

Katetc avatar Apr 26 '24 17:04 Katetc

It turns out that the SCM test only uses one MPI task. In the ccs_config0.0.85, it used the raw mpiexec command (https://github.com/ESMCI/ccs_config_cesm/blob/ccs_config_cesm0.0.85/machines/config_machines.xml#L1283-L1288) and it worked fine with -n 1 option. In the ccs_config0.0.99, it has been updated to use the wrapper script mpibind (https://github.com/ESMCI/ccs_config_cesm/blob/ccs_config_cesm0.0.99/machines/derecho/config_machines.xml#L24-L29), which eases the runtime setup for an MPI/OpenMP hybrid job. By switching back to the raw mpiexec command, I could confirm that that ccs_config0.0.99 also passes the SCM test on Derecho, following Kate's instructions above.

sjsprecious avatar Apr 26 '24 22:04 sjsprecious

@jedwards4b found out that there was an issue between ESMF and mpibind when only using a single MPI task.

sjsprecious avatar Apr 26 '24 22:04 sjsprecious

The issue is that the first of the two jobs run in the SCT test uses the entire node (128 tasks). The second only uses one - but esmf is initializing as if it were 128, I'm working with Rory to find a solution.

jedwards4b avatar Apr 26 '24 22:04 jedwards4b