CMEPS icon indicating copy to clipboard operation
CMEPS copied to clipboard

Problems with "S" compsets -- is it a valid case to run?

Open ekluzek opened this issue 6 months ago • 2 comments

I wanted to get CTSM to work with SATM, and had problems, so I tried the "S" compset to see if it would work. And it doesn't. I ran into a simple problem with buildnml, and then a problem on submission.

But, it looks like stub components are configured so that they are marked as "invalid" and CESM is setup to run without a mediator when you only have one "valid" component. But, if "CPL" is the only valid component that's a contradiction.

Here's the buildnml problem I ran into:

The problem is at the end since coupling_times doesn't have an entry for cpl_cpl_dt.

./create_test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel -r .
Testnames: ['SMS_D_Ln1.f10_f10_mg37.S.derecho_intel']
Using project from .cesm_proj: P93300606
create_test will do up to 1 tasks simultaneously
create_test will use up to 160 cores simultaneously
Creating test directory /glade/derecho/scratch/erik/cesm3_0_alpha07a/cime/scripts/SMS_D_Ln1.f10_f10_mg37.S.derecho_intel.20250701_104712_kdjq6k
RUNNING TESTS:
  SMS_D_Ln1.f10_f10_mg37.S.derecho_intel
Starting CREATE_NEWCASE for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel with 1 procs
Finished CREATE_NEWCASE for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel in 2.141925 seconds (PASS)
Starting XML for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel with 1 procs
Finished XML for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel in 0.605212 seconds (PASS)
Starting SETUP for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel with 1 procs
Finished SETUP for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel in 2.237583 seconds (PASS)
Starting SHAREDLIB_BUILD for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel with 1 procs
Finished SHAREDLIB_BUILD for test SMS_D_Ln1.f10_f10_mg37.S.derecho_intel in 2.026433 seconds (FAIL). [COMPLETED 1 of 1]
    Case dir: /glade/derecho/scratch/erik/cesm3_0_alpha07a/cime/scripts/SMS_D_Ln1.f10_f10_mg37.S.derecho_intel.20250701_104712_kdjq6k
    Errors were:
        Building test for SMS in directory /glade/derecho/scratch/erik/cesm3_0_alpha07a/cime/scripts/SMS_D_Ln1.f10_f10_mg37.S.derecho_intel.20250701_104712_kdjq6k
        Traceback (most recent call last):
          File 
.
.
.
          File "/glade/derecho/scratch/erik/cesm3_0_alpha07a/components/cmeps/cime_config/buildnml", line 506, in _create_drv_namelists
            _create_runseq(case, coupling_times, valid_comps)
            ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          File "/glade/derecho/scratch/erik/cesm3_0_alpha07a/components/cmeps/cime_config/buildnml", line 583, in _create_runseq
            dtime = coupling_times[valid_comps[0].lower() + "_cpl_dt"]
                    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        KeyError: 'cpl_cpl_dt'

Waiting for tests to finish

I got around that by arbitrarily getting dtime from SATM. But, then it fails at submit because minncpl is 0 and it gets a divide by zero here:

  File "/glade/derecho/scratch/erik/cesm3_0_alpha07a/cime/CIME/case/case_submit.py", line 174, in _submit
    case.check_case(skip_pnl=skip_pnl, chksum=chksum)
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/glade/derecho/scratch/erik/cesm3_0_alpha07a/cime/CIME/case/case_submit.py", line 351, in check_case
    timestep = 86400 / minncpl
               ~~~~~~^~~~~~~~~
ZeroDivisionError: division by zero

Which I can then get past by assigning: maxcomp = "ATM", and minncpl and maxncpl to the ATM values.

It then fails at runtime with a seg-fault:

dec2206.hsn.de.hpc.ucar.edu 125: forrtl: severe (174): SIGSEGV, segmentation fault occurred
dec2206.hsn.de.hpc.ucar.edu 125: Image              PC                Routine            Line        Source             
dec2206.hsn.de.hpc.ucar.edu 125: libpthread-2.31.s  000014A52A75C8C0  Unknown               Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: libmpi_intel.so.1  000014A5283993E1  Unknown               Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: libmpi_intel.so.1  000014A5283997B8  Unknown               Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: libmpi_intel.so.1  000014A52820BCBE  Unknown               Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: libmpi_intel.so.1  000014A526C8FE68  MPI_Abort             Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: libesmf.so         000014A53251FCD2  abort                     904  ESMCI_VMKernel.C
dec2206.hsn.de.hpc.ucar.edu 125: libesmf.so         000014A532519FE3  abort                    3721  ESMCI_VM.C
dec2206.hsn.de.hpc.ucar.edu 125: libesmf.so         000014A532545E61  c_esmc_vmabort_          1252  ESMCI_VM_F.C
dec2206.hsn.de.hpc.ucar.edu 125: libesmf.so         000014A5337B4A7B  c_esmc_vmabort_.t           0  ESMF_VM.F90
dec2206.hsn.de.hpc.ucar.edu 125: libesmf.so         000014A5337AB39C  esmf_vmabort             9525  ESMF_VM.F90
dec2206.hsn.de.hpc.ucar.edu 125: libesmf.so         000014A53337DDCC  esmf_finalize            1712  ESMF_Init.F90
dec2206.hsn.de.hpc.ucar.edu 125: cesm.exe           000000000044702A  MAIN__                    136  esmApp.F90
dec2206.hsn.de.hpc.ucar.edu 125: cesm.exe           000000000041FC5D  Unknown               Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: libc-2.31.so       000014A5262C929D  __libc_start_main     Unknown  Unknown
dec2206.hsn.de.hpc.ucar.edu 125: cesm.exe           000000000041FB8A  Unknown               Unknown  Unknown

Where line 136 is the last line here:

  !-----------------------------------------------------------------------------
  ! Call Initialize for the earth system ensemble Component
  !-----------------------------------------------------------------------------

  call ESMF_GridCompInitialize(ensemble_driver_comp, userRc=urc, rc=rc)
  if (ESMF_LogFoundError(rcToCheck=rc, msg=ESMF_LOGERR_PASSTHRU, &
       line=__LINE__, &
       file=__FILE__)) &
       call ESMF_Finalize(endflag=ESMF_END_ABORT)
  if (ESMF_LogFoundError(rcToCheck=urc, msg=ESMF_LOGERR_PASSTHRU, &
       line=__LINE__, &
       file=__FILE__)) &
       call ESMF_Finalize(endflag=ESMF_END_ABORT)

ekluzek avatar Jul 01 '25 18:07 ekluzek

This might mean that a "S" compset is just an invalid case that we shouldn't run. And if so we should remove it from the compsets, and also trap for it in the scripting to mark it as an invalid compset.

But, this is also the same point where I have an I compset with SATM fail, which I do want to get working for testing.

ekluzek avatar Jul 01 '25 18:07 ekluzek

Using the CMEPS mediator has eliminated the need for Stub components, so there is no longer an SATM.

jedwards4b avatar Aug 27 '25 20:08 jedwards4b