scream icon indicating copy to clipboard operation
scream copied to clipboard

IO overwriting of monthly averages

Open mahf708 opened this issue 1 year ago • 6 comments

Another concerning issue in the EAMxx IO. Consider the following atm.log snippet:

Atmosphere step = 342143
  model start-of-step time = 2020-08-31 23:58:20

[EAMxx::output_manager] - Writing model-output:
[EAMxx::output_manager]      FILE: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc
[EAMxx::scorpio_output] Writing variables to file
  file name: 1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc

The result: the monthly output file was overwritten. This happened in two instances in one run:

1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>>
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc 
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
                                                   >>>>>>>>>>>>>>> simulation ends

See internal outputs https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/4334223933/EAMxx+ERFaer+production from a recent run using commit https://github.com/E3SM-Project/scream/commit/29bdb81 on branch https://github.com/E3SM-Project/scream/tree/mahf708-ff-a73d48a

mahf708 avatar Jul 05 '24 21:07 mahf708

I think this is the first time we've seen this, but checking with @ndkeen to see if he has seen something like this. @AaronDonahue @bartgol : any ideas on what might be going on here? And if there's a fix, we should make sure to get it into @brhillman's decadal run. And we should keep an eye on the averaged output in the decadal sim until we find the cause and solution.

crterai avatar Jul 05 '24 21:07 crterai

@mahf708, can you share the YAML file for these outputs?

AaronDonahue avatar Jul 08 '24 17:07 AaronDonahue

Here's the output yaml: https://acme-climate.atlassian.net/wiki/spaces/EAMXX/pages/3969187877/1ma+ne30pg2.yaml, which is a verbatim copy of the outputs Ben is using (circa May 1) but with small additions.

mahf708 avatar Jul 08 '24 18:07 mahf708

thanks, I'll start working on this.

AaronDonahue avatar Jul 08 '24 22:07 AaronDonahue

Does this happen w/ a restarted run?

AaronDonahue avatar Jul 08 '24 22:07 AaronDonahue

Does this happen w/ a restarted run?

We will unlikely find a deterministic reproducer for this in any short period of time. This happened in two runs, in two separate occasions in each, so four times total. Here's how it played out (roughly)

  • model fails with a system-side issue
  • model starts overwriting the monthly files the next time it tries to output them
  • model keeps doing that whacky stuff
  • model finally finishes a good submission (with no fail) and starts behaving normally

The wildest thing? It starts behaving normally.

The short answer, yes, this can only happen in restarts. I think it is important to consider all four issues I filed so far as one large issue (I suspect they are related).

Note in OP:

1ma_ne30pg2.AVERAGE.nmonths_x1.2019-08-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-09-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-10-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-11-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2019-12-01-00000.nc
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-01-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-01-01
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-04-01-00000.nc >>>>>>>>>>>>>>> 2 files gone, 1 misnamed
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-05-01-00000.nc 
1ma_ne30pg2.AVERAGE.nmonths_x1.2020-06-01-00000.nc <<<<<<<<<<<<<<< overwriting 2020-06-01
                                                   >>>>>>>>>>>>>>> simulation ends; 2 files gone, 1 misnamed

mahf708 avatar Jul 08 '24 23:07 mahf708

I think this issue is superseded by #3026, so I am going to close it.

mahf708 avatar Oct 04 '24 17:10 mahf708