IO overwriting of monthly averages
Another concerning issue in the EAMxx IO. Consider the following atm.log snippet:
Atmosphere step = 342143
model start-of-step time = 2020-08-31 23:58:20
[EAMxx::output_manager] - Writing model-output:
[EAMxx::output_manager] FILE: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc
[EAMxx::scorpio_output] Writing variables to file
file name: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc
The result: the monthly output file was overwritten. This happened in two instances in one run:
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>>
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
>>>>>>>>>>>>>>> simulation ends
See internal outputs https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/4334223933/EAMxx+ERFaer+production from a recent run using commit https://github.com/E3SM-Project/scream/commit/29bdb81 on branch https://github.com/E3SM-Project/scream/tree/mahf708-ff-a73d48a
I think this is the first time we've seen this, but checking with @ndkeen to see if he has seen something like this. @AaronDonahue @bartgol : any ideas on what might be going on here? And if there's a fix, we should make sure to get it into @brhillman's decadal run. And we should keep an eye on the averaged output in the decadal sim until we find the cause and solution.
@mahf708, can you share the YAML file for these outputs?
Here's the output yaml: https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/3969187877/1ma+ne30pg2.yaml, which is a verbatim copy of the outputs Ben is using (circa May 1) but with small additions.
thanks, I'll start working on this.
Does this happen w/ a restarted run?
Does this happen w/ a restarted run?
We will unlikely find a deterministic reproducer for this in any short period of time. This happened in two runs, in two separate occasions in each, so four times total. Here's how it played out (roughly)
- model fails with a system-side issue
- model starts overwriting the monthly files the next time it tries to output them
- model keeps doing that whacky stuff
- model finally finishes a good submission (with no fail) and starts behaving normally
The wildest thing? It starts behaving normally.
The short answer, yes, this can only happen in restarts. I think it is important to consider all four issues I filed so far as one large issue (I suspect they are related).
Note in OP:
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>> 2 files gone, 1 misnamed
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
>>>>>>>>>>>>>>> simulation ends; 2 files gone, 1 misnamed
I think this issue is superseded by #3026, so I am going to close it.