CAM icon indicating copy to clipboard operation
CAM copied to clipboard

ESMF regrid error in WACCM-X at 1 degree resolution on Derecho

Open npedatella opened this issue 1 year ago • 14 comments

What happened?

When running WACCM-X at 1 degree resolution on Derecho with CESM2.2 the model crashes due to an error in ESMF. The specific error in the CESM log file is:

edyn_esmf_update: error return from ESMF_FieldRegridStore for 3d mag2geo: rc= 6 ERROR: edyn_esmf_update: ESMF_FieldRegridStore for 3d mag2geo phi3d

The ESMF log file gives the following error: 20231010 151835.112 ERROR PET267 ESMF_FieldRegrid.F90:4329 ESMF_FieldRegridGetIwts Invalid argument - - can't currently regrid a grid that contains a DE of width less than 2 20231010 151835.113 ERROR PET267 ESMF_FieldRegrid.F90:3180 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error 20231010 151835.113 ERROR PET267 ESMF_FieldRegrid.F90:1349 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error 20231010 151835.113 ERROR PET267 ESMF_FieldRegrid.F90:974 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error

What are the steps to reproduce the bug?

CESM2.2 case on Derecho with resolution f09_f09_mg17 and compset FXHIST

Example case is in /glade/derecho/scratch/nickp/tmp/test_wx_1deg/

What CAM tag were you using?

CESM2.2

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

/glade/derecho/scratch/nickp/tmp/test_wx_1deg/

Will you be addressing this bug yourself?

No

Extra info

No response

npedatella avatar Oct 11 '23 16:10 npedatella

@npedatella Can you try updating esmf to esmf/8.6.0b04 in cime/config/cesm/machines/config_machines.xml do a clean build and let me know if you still get the error.

jedwards4b avatar Oct 11 '23 16:10 jedwards4b

I updated esmf to esmf/8.6.0b04. I get a similar error, though the ESMF log file is slightly different (case /glade/derecho/scratch/nickp/tmp/test_wx_1deg.002):

20231011 112737.111 ERROR PET277 ESMF_FieldRegrid.F90:4404 checkGrid Invalid argument - some types of regridding (e.g. bilinear) are not supported on Grids that contain a DE of width 1. 20231011 112737.112 ERROR PET277 ESMF_FieldRegrid.F90:3191 b_or_p_GridToMesh Invalid argument - Internal subroutine call returned Error 20231011 112737.112 ERROR PET277 ESMF_FieldRegrid.F90:1350 getMeshWithNodesOnFieldLoc Invalid argument - Internal subroutine call returned Error 20231011 112737.112 ERROR PET277 ESMF_FieldRegrid.F90:976 ESMF_FieldRegridStoreNX Invalid argument - Internal subroutine call returned Error

npedatella avatar Oct 11 '23 17:10 npedatella

@npedatella the issue is that you're dividing an ESMF Grid finely enough across processors that you have less than 1 complete cell along some dimension on some DEs/processor. For some types of regridding ESMF has the constraint that it can't have part of a Grid cell on a DE. This often occurs when you're creating a Grid and just distributing it along one dimension (e.g. only dividing it along the longitude). Can you check if you're doing that by looking at the ESMF_GridCreate() call? If so, dividing it along both dimensions will help. (As a quick fix, running on fewer processors will help as well, but I'm not sure if you'd want to do that.)

(BTW, getting rid of this constraint is on my todo list, but I haven't had a chance to get to it yet.)

oehmke avatar Oct 11 '23 20:10 oehmke

@npedatella. - is there a strong reason to use the mct coupler? Can you try with the nuopc driver?

jedwards4b avatar Oct 11 '23 20:10 jedwards4b

@oehmke I tried running with fewer processors (128, i.e., one node) and still had the same problem.

npedatella avatar Oct 13 '23 19:10 npedatella

@jedwards4b I think that the mct coupler is the default setting which is why it is being used. I tried changing to nuopc (xmlchange COMP_INTERFACE=nuopc) and am unable to run the setup script or build the model.

npedatella avatar Oct 13 '23 19:10 npedatella

Yes I understand - at this point I am suggesting that you move to cesm2.3.x where this case works - unless you want to back port changes in cam to 2.2.

jedwards4b avatar Oct 13 '23 19:10 jedwards4b

@jedwards4b OK. Can you recommend a version that should be used going forward?

npedatella avatar Oct 13 '23 20:10 npedatella

cesm2_3_beta15

jedwards4b avatar Oct 13 '23 20:10 jedwards4b

@npedatella - Can this issue be closed as "do not fix"?

cacraigucar avatar Oct 16 '23 15:10 cacraigucar

I have a meeting later this morning to discuss with @fvitt

jedwards4b avatar Oct 16 '23 15:10 jedwards4b

@npedatella For your f09 case in CESM2.2 with 256 mpi tasks, try this namelist setting:

npr_yz = 32,8,8,32

This divides the mag grid across fewer mpi tasks in the latitude direction.

In CESM2.2, which is before the regrid refactoring in waccmx, the decomposition of the mag and oplus grids used the FV dycore grid decomposition settings. In CESM, after regrid refactoring, the mag and oplus grids are no longer tied to the FV dycore grid decomposition,

fvitt avatar Oct 16 '23 16:10 fvitt

@fvitt The f09 case works with these namelist settings if I use 256 mpi tasks. However, when I setup a new case it defaults to 512 tasks and the settings do not work. Should the default settings be changed?

npedatella avatar Oct 17 '23 19:10 npedatella

@fvitt The f09 case works with these namelist settings if I use 256 mpi tasks. However, when I setup a new case it defaults to 512 tasks and the settings do not work. Should the default settings be changed?

For 512 tasks try: npr_yz = 32,16,16,32

fvitt avatar Oct 17 '23 20:10 fvitt