E3SM icon indicating copy to clipboard operation
E3SM copied to clipboard

MPAS output all zero values in a middle of a v3 coupled simulation

Open zhangshixuan1987 opened this issue 1 year ago • 6 comments

With master (Hash: 84e50561a854e1888b0eaa52fc3a44287f3a5924), I've been trying to run a fully coupled simulation with atmospheric nudging to test the impact of the wind forcing over the subpolar North Atlantic on AMOC. The simulation was run on pm-cpu with intel compiler, which is documented on the following confluence page,

In brief,

  • The simulation was first run from 0001-01-01 to 0034-04-11, and paused to check the first 30 years simulations.
  • The model was restarted at 0034-01-01, and continued to run from 0034-01-01 to 0043-09-11 and cancelled due to the wall time limit.

One error appears when I check the results obtained from the MPASS diagnostics. There is kink appears at around year 0034-0035 as shown in the figure below for the ocean heat contents: image Similar issues are also seen in the AMOC timeseries

Further diagnostics indicate that the issues pointed to the model output at 0034-10-01 from mpass-ocean: the output for almost all quantities are zero values in the model historical files (mpaso.hist.am.timeSeriesStatsMonthly.0034-10-01.nc). Only this file has has the issue, the other historical files look correct.

We note that 0034-10-01 was saved in the middle of the simulation, and the model neither crashed nor reported an error during the whole simulation period of 0034-01-01 -- 0043-09-11. Therefore, it seems that this could be potentially a hiccup or a bug related to the i/o infrastructure (in the model, file system, or IO nodes if pm-cpu uses one).

Reported here in case it recurs. For this case, we are going to re-run year 0034 to see if simulation data beyond the problematic month are affected.

zhangshixuan1987 avatar Nov 03 '23 21:11 zhangshixuan1987

@zhangshixuan1987, my first guess would be that this was a glitch of some sort in the Perlmutter file system. I haven't seen a problem like this before that I recall. Could you try rerunning just year 0034 from a restart file and see if the output gets corrected?

xylar avatar Nov 04 '23 07:11 xylar

Following suggestions from @wlin7 and @xylar, I conducted a "continue run" with the restart files saved at 0034-01-01. The simulation was run for 2 years from 0034-01-01 to 0036-01-01 and the model output was saved. The new generated model output during the 0034-01-01 -- 0036-01-01 was used to replace the old model output files at these periods. Then I rerun the MPASS diagnostics. The kinks at around year 0034-0035 in the figure of ocean heat contents now disappear:

image

I also checked the historical files regenerated by E3SM for "mpaso.hist.am.timeSeriesStatsMonthly.0034-10-01.nc", and all quantities in this file now have reasonable values rather than "zeros". Therefore, I think @xylar is correct that the issues are likely due to "a glitch of some sort in the Perlmutter file system". However, the reason why such a glitch showed up in my simulation is still not clear to me.

zhangshixuan1987 avatar Nov 08 '23 19:11 zhangshixuan1987

@zhangshixuan1987, I agree, this is mysterious and frustrating. Certainly if it happens again, we need to figure out a way to reproduce it so we can prevent it from happening again. For now, let's hope it's a one-time event!

xylar avatar Nov 08 '23 19:11 xylar

Adding @ndkeen and @jayeshkrishna to note glitch.

rljacob avatar Nov 08 '23 19:11 rljacob

Following suggestions from Wuyin (@wlin7), I also run the "/global/cfs/cdirs/e3sm/tools/cprnc/cprnc" on the file

  • "20231001.v3alpha04_trigrid.nudg3hr.piControl.pm-cpu_intel.eam.h0.0034-10.nc" This is the historical file from atmosphere model component at the same time when MPASO output is corrupted. The file from the original simulation, and the file from the restart simulation as described above were provided to cprnc. The log file below contains the information from cprnc: cprnc.log

Overall, the two files are likely bit-for-bit identical, suggesting that the model simulation for other component seems to be not affected.

zhangshixuan1987 avatar Nov 10 '23 01:11 zhangshixuan1987

Just noting that we had a similar-sounding issue a few years ago, but surely it's not the same thing. https://github.com/E3SM-Project/E3SM/issues/4174

ndkeen avatar Nov 10 '23 01:11 ndkeen