CAM
CAM copied to clipboard
MPASA restart failure
What happened?
I noticed when testing ERS_Ln9.mpasa7p5_mpasa7p5_mg17.QPC6.derecho_intel.cam-outfrq9s that CLUBB was generating an error on restart: Error in advance_xp2_xpyp First this leads to an intolerable amount of output to stdout that will need to be addressed for high resolution runs.
Second I repeated this test with ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s and it fails with the same error.
What are the steps to reproduce the bug?
Just run the test: ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s
What CAM tag were you using?
cam6_3_119 - cam6_3_122
What machine were you running CAM on?
CISL machine (e.g. cheyenne)
What compiler were you using?
Intel
Path to a case directory, if applicable
No response
Will you be addressing this bug yourself?
Yes, but I will need some help
Extra info
No response
I don't know that this will resolve the restart failure, but it's relevant to the massive restart files. When we updated the clubb externals earlier this year, we switched the clubb pdf closure to after the clubb solver, which requires adding a bunch of high order moments, generated by the pdf closure, into the restarts. You can experiment with putting the pdf closure back, which I believe trims the restarts to the ~size they were in the old externals.
clubb_ipdf_call_placement=1
in user_nl_cam will switch it back. @Katetc can you verify whether this will trim the restarts?
@adamrher - that didn't solve the problem, but thank you for the suggestion.
OK. In case it helps, to trim the default i/o in h0 tapes, I usually remove all the aerosol/chem species via:
history_chemistry = .true.
history_chemspecies_srf = .true.
I'm using empty_htapes=.true., this is purely a restart issue.
I was able to run ERS_Ln9.mpasa120_mpasa120.QPC6.derecho_intel.cam-outfrq9s and ERS_Ln9.mpasa60_mpasa60.QPC6.derecho_intel.cam-outfrq9s successfully but ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s still gives the same error in spite of extreme changes in timestep:
mpas_dt = 45.0 ATM_NCPL: 80 NCPL_BASE_PERIOD: hour
@jedwards4b What are the values of P0 and pbuf_time_idx in the restart files of these failing cases? In my high-res (ne120) waccmx runs on derecho these are zero in the restart file, while they should be 100000 and 1, respecively.
@fvitt thanks - there is no P0 in the file but pbuf_time_idx is 0 where it is 1 in the lower resolution files.
@fvitt can you provide instructions to reproduce the ne120 case, I would like to add it to my testing.
I'll be interested to see the setup as well. Looking through a current set of regression tests it seems like we haven't been testing aquaplanet MPAS and I don't see an aquaplanet initial condition (APE). I tried configuring one using analytic atmospheric conditions but was getting an error failing to read u wind which should have been prescribed by setting analytic. That seems to be a bug. Aquaplanet should also work given an aquaplanet initial condition file with PHIS set to 0. If you are testing high resolution MPAS it might be more straight forward to running pure analytic (FHS94) or a full F case like F2000climo which we have initial condition files for and are routinely run as part of the regression tests.
@jtruesdal Note that the lower resolution cases work and only the high-res cases fail - I can also print the value of eg pbuf_time_idx in the initial case and see that the value that should be written is correctly passed to PIO but the value in the file is 0. I think that this is a problem in the IO stack someplace and not in cam.
@jedwards4b
Clone this case:
/glade/derecho/scratch/fvitt/fx2000_ne120pg3L273_test08
@fvitt I found that the ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s case works when I increase the NTASKS which confirms that the problem is memory. I will continue to work on trapping this error but I would suggest that you try using a larger pelayout for your case. I see that you are currently using 7200 - it would be better to use a multiple of 128. Maybe NTASKS=12800?
I was able to run and pass ERS_Ln9.mpasa30_mpasa30.QPC6.derecho_intel.cam-outfrq9s restart tests on 4480 (35 nodes) and 1536 (12 nodes) tasks. The original case that failed had 512 tasks. Working on the theory that this is a memory issue I tried going back to NTASKS=512 but using fewer tasks per node. I tried MAX_MPITASKS_PER_NODE=64 (8 nodes) FAILS and MAX_MPITASKS_PER_NODE=32 (16 nodes) PASSES.
I can reproduce this issue with: SMS_Ln3.ne30pg3_ne30pg3_mg17.FMTHIST_v0d.derecho_intel running on 128 tasks with REST_N=3,REST_OPTION=nsteps.
@jedwards4b - Is this still an issue for you?
I think that the question is - is it an issue for you? I would suggest that you rerun this test to see: SMS_Ln3.ne30pg3_ne30pg3_mg17.FMTHIST_v0d.derecho_intel
@briandobbins - @PeterHjortLauritzen suggested you might have some input on this