CAM MPASA restart failure

What happened?

I noticed when testing ERS_Ln9.mpasa7p5_mpasa7p5_mg17.QPC6.derecho_intel.cam-outfrq9s that CLUBB was generating an error on restart: Error in advance_xp2_xpyp First this leads to an intolerable amount of output to stdout that will need to be addressed for high resolution runs.

Second I repeated this test with ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s and it fails with the same error.

What are the steps to reproduce the bug?

Just run the test: ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s

What CAM tag were you using?

cam6_3_119 - cam6_3_122

What machine were you running CAM on?

CISL machine (e.g. cheyenne)

What compiler were you using?

Intel

Path to a case directory, if applicable

No response

Will you be addressing this bug yourself?

Yes, but I will need some help

Extra info

No response

Aug 14 '23 14:08 jedwards4b

I don't know that this will resolve the restart failure, but it's relevant to the massive restart files. When we updated the clubb externals earlier this year, we switched the clubb pdf closure to after the clubb solver, which requires adding a bunch of high order moments, generated by the pdf closure, into the restarts. You can experiment with putting the pdf closure back, which I believe trims the restarts to the ~size they were in the old externals.

clubb_ipdf_call_placement=1

in user_nl_cam will switch it back. @Katetc can you verify whether this will trim the restarts?

Aug 14 '23 23:08 adamrher

@adamrher - that didn't solve the problem, but thank you for the suggestion.

Aug 15 '23 00:08 jedwards4b

OK. In case it helps, to trim the default i/o in h0 tapes, I usually remove all the aerosol/chem species via:

 history_chemistry              =             .true.
 history_chemspecies_srf                =       .true.

Aug 15 '23 15:08 adamrher

I'm using empty_htapes=.true., this is purely a restart issue.

Aug 15 '23 15:08 jedwards4b

I was able to run ERS_Ln9.mpasa120_mpasa120.QPC6.derecho_intel.cam-outfrq9s and ERS_Ln9.mpasa60_mpasa60.QPC6.derecho_intel.cam-outfrq9s successfully but ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s still gives the same error in spite of extreme changes in timestep:

mpas_dt = 45.0 ATM_NCPL: 80 NCPL_BASE_PERIOD: hour

Aug 19 '23 00:08 jedwards4b

@jedwards4b What are the values of P0 and pbuf_time_idx in the restart files of these failing cases? In my high-res (ne120) waccmx runs on derecho these are zero in the restart file, while they should be 100000 and 1, respecively.

Aug 19 '23 00:08 fvitt

@fvitt thanks - there is no P0 in the file but pbuf_time_idx is 0 where it is 1 in the lower resolution files.

Aug 19 '23 13:08 jedwards4b

@fvitt can you provide instructions to reproduce the ne120 case, I would like to add it to my testing.

Aug 21 '23 16:08 jedwards4b

I'll be interested to see the setup as well. Looking through a current set of regression tests it seems like we haven't been testing aquaplanet MPAS and I don't see an aquaplanet initial condition (APE). I tried configuring one using analytic atmospheric conditions but was getting an error failing to read u wind which should have been prescribed by setting analytic. That seems to be a bug. Aquaplanet should also work given an aquaplanet initial condition file with PHIS set to 0. If you are testing high resolution MPAS it might be more straight forward to running pure analytic (FHS94) or a full F case like F2000climo which we have initial condition files for and are routinely run as part of the regression tests.

Aug 21 '23 16:08 jtruesdal

@jtruesdal Note that the lower resolution cases work and only the high-res cases fail - I can also print the value of eg pbuf_time_idx in the initial case and see that the value that should be written is correctly passed to PIO but the value in the file is 0. I think that this is a problem in the IO stack someplace and not in cam.

Aug 21 '23 17:08 jedwards4b

@jedwards4b Clone this case: /glade/derecho/scratch/fvitt/fx2000_ne120pg3L273_test08

Aug 21 '23 17:08 fvitt

@fvitt I found that the ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s case works when I increase the NTASKS which confirms that the problem is memory. I will continue to work on trapping this error but I would suggest that you try using a larger pelayout for your case. I see that you are currently using 7200 - it would be better to use a multiple of 128. Maybe NTASKS=12800?

Aug 21 '23 20:08 jedwards4b

I was able to run and pass ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s restart tests on 4480 (35 nodes) and 1536 (12 nodes) tasks. The original case that failed had 512 tasks. Working on the theory that this is a memory issue I tried going back to NTASKS=512 but using fewer tasks per node. I tried MAX_MPITASKS_PER_NODE=64 (8 nodes) FAILS and MAX_MPITASKS_PER_NODE=32 (16 nodes) PASSES.

Aug 21 '23 21:08 jedwards4b

I can reproduce this issue with: SMS_Ln3.ne30pg3_ne30pg3_mg17.FMTHIST_v0d.derecho_intel running on 128 tasks with REST_N=3,REST_OPTION=nsteps.

Aug 22 '23 13:08 jedwards4b

@jedwards4b - Is this still an issue for you?

Jan 04 '24 22:01 cacraigucar

I think that the question is - is it an issue for you? I would suggest that you rerun this test to see: SMS_Ln3.ne30pg3_ne30pg3_mg17.FMTHIST_v0d.derecho_intel

Jan 04 '24 22:01 jedwards4b

@briandobbins - @PeterHjortLauritzen suggested you might have some input on this

Jul 11 '24 22:07 cacraigucar

CAM CAM copied to clipboard

MPASA restart failure

What happened?

What are the steps to reproduce the bug?

What CAM tag were you using?

What machine were you running CAM on?

What compiler were you using?

Path to a case directory, if applicable

Will you be addressing this bug yourself?

Extra info

CAM
CAM copied to clipboard