CESM Standardize the naming and attributes of ancillary time arrays across components

A request from @phillips-ad, based on user input:

Standardize the naming and attributes of ancillary time arrays across components. (time_bnds in cam, time_bounds in clm, no calendar attribute set for either, etc..)

@klindsay28 points out that well-written scripts shouldn't require uniformity in names across components. However, I feel like this is something that might be relatively easy to standardize and could save significantly more time than it takes us to make the change, so we might as well do it.

I'm not sure exactly what needs to be done. When someone gets a chance to coordinate this, we should open issues in the respective component repositories asking for changes and note those issues here to track them.

Nov 03 '21 03:11 billsacks

@billsacks thanks for writing this up. I can definitely come up with a list of proposed changes to output ancillary variable attributes as I'm one of the folks who is pushing for this. I will post the list in a future reply to this thread so others can review.

Nov 03 '21 21:11 phillips-ad

There are more inconsistencies across the components for the output time/time_bounds arrays beyond the naming of the time_bounds arrays. Below is a summary list of potential changes. Examples by component of the current status of these arrays and what the Tier 1+2 proposed changes would look like are shown in a Google Doc.

Tier 1 proposed changes

The time endpoint arrays should all be named "time_bounds", and the two dimensions of this array should be named "time" and "nbnd".
time_bounds should have the same calendar and units attributes as time.
time_bounds@long_name should be set to "time interval endpoints"
GLC output for the time array has a units attribute setting ("common_year since 0000-01-01 0:0:0") different from all other components. This should be changed to either "days since YYYY-MM-DD HH:MM:SS" (set to the beginning of the run, used by atm/land) or "days since 0001-01-01 00:00:00" (used by ocean/ice).
At present there is no time_bounds array for GLC output. This should be added.

Tier 2 proposed change

The time and equivalent time_bounds arrays are of type double for some components and type float for others. All components should output type double for both arrays.

Tier 3 (Inconsistency should be left alone?)

Atmosphere/land models have a time@units setting of "days since YYYY-MM-DD HH:MM:SS" which equates to the start time of the run, while ocean/ice has a setting of "days since 0001-01-01 00:00:00". As long as proper time attributes are set, and subsequent conversions of the time by common tools work, I don't see why this needs to be changed.

Nov 04 '21 21:11 phillips-ad

Adam nails every issue exactly.

My only comment is that for Tier 3 it would be nice for all components to use the same fixed absolute time start date of "days since 0001-01-01 00:00:00", as ocean and ice already do. Only atm and lnd reset that, as Adam notes, in certain contexts (a branch run, IIRC).

Nov 05 '21 16:11 strandwg

If we are standardizing the time units (I am in favor), is there a reason we do not adopt ISO 8601 notation? Can our new standard explicitly spell out a standard for paleo runs (technically, any run with dates before 1583)?

Nov 13 '21 17:11 gold2718

There's a discussion of this issue here

https://github.com/cf-convention/cf-conventions/issues/298

Nov 15 '21 18:11 strandwg

For CTSM, see https://github.com/ESCOMP/CTSM/issues/1693

For CICE, see https://github.com/ESCOMP/CESM_CICE/issues/14

For CISM, see https://github.com/ESCOMP/CISM-wrapper/issues/75

For CAM, see https://github.com/ESCOMP/CAM/issues/554

For RTM, see https://github.com/ESCOMP/RTM/issues/31

For MOSART, see https://github.com/ESCOMP/MOSART/issues/53

For MOM, see https://github.com/ESCOMP/MOM_interface/issues/107

For WW3, see https://github.com/ESCOMP/WW3-CESM/issues/22

Mar 29 '22 20:03 billsacks

@phillips-ad @strandwg - Assuming that the resolution of https://github.com/ESCOMP/CAM/issues/159, https://github.com/ESCOMP/CTSM/issues/1059 and similar issues for other components is that we separate time-averaged from instantaneous values so that a given file only contains one or the other: What would you suggest we do for the time_bounds field on files that only contain instantaneous values? Should we leave time_bounds off in this case? Or make it the start and end of the relevant time step, as @ekluzek suggests in https://github.com/ESCOMP/CTSM/issues/1059#issue-644795852 ?

Mar 30 '22 23:03 billsacks

"time_bounds" doesn't make sense for instantaneous values, and for our MIP data, instantaneous data aren't required to have it for the time axis. I'd leave it out of instantaneous output streams.

Since averaged data and instantaneous data will be in different streams, what should the streams be and how to differentiate them?

Mar 31 '22 15:03 strandwg

I'm honestly ambivalent on this, so I will defer on leaving time_bounds off.

In response to @strandwg 's question, I believe average, min and max will be in one stream, and instantaneous will be in other streams. An example of how I thought this would work for CAM/CTSM streams: h0 would contain A, M and X for monthly, h1 would contain the same for daily, and so on, and then (say) h8 would contain instantaneous (=I) for monthly, h9 would contain I for daily and so on. But it would be user customizable, so instead of h8 containing monthly I the user could set the stream to h1 if they want. I haven't seen a discussion of this, but I might've missed it.

@strandwg, were you wondering whether there should be h0, h1, etc streams for A, M, X, but streams named (say) hi0, hi1, etc for instantaneous streams?

Mar 31 '22 15:03 phillips-ad

Thanks @strandwg and @phillips-ad .

Good question about distinguishing between the different files / streams. I can see how naming files with something like hi would make this more clear for users. This might be something that's hard to make consistent across all components, but I'd welcome suggestions for CTSM, and maybe we can at least try to push for CTSM and CAM to be consistent, since their history infrastructure is somewhat similar.

Mar 31 '22 17:03 billsacks

"time_bounds" isn't applicable for an instantaneous value, so it's not needed.

The critical issue about differentiating instantaneous and averaged fields is keeping them separate. It's not uncommon to save the same field from CAM as both, but in different "cam.hN" streams. That kind of implementation needs to be kept intact.

It's unfortunate that "cam.i.*" files already exist as initial files, because I think replicating the ".h[0-9]" (averaged) as ".i[0-9]" (instantaneous) would be clear and unambiguous. I don't know if that's reasonable or would cause too much confusion.

Mar 31 '22 17:03 strandwg

After today's CISM meeting, I realized that we never spoke again of the fact that some components set time units to the start of the run with "days since YYYY-MM-DD HH:MM:SS" (CAM/CTSM/RTM/MOSART/WW3), while others use or are being pushed to use "days since 0001-01-01 00:00:00" (MOM, CICE, CISM). I previously cited this as an optional tier 3 change above. (https://github.com/ESCOMP/CESM/issues/194#issuecomment-961443965) But @strandwg said it would be nice if this was consistent across components. I agree, but I wonder whether it is too big an ask to implement this. Are we OK with different components setting the time units attribute one of these two possible ways? In my recent filed issues for each component I did propose that each can use one of these two methods.

Apr 04 '22 18:04 phillips-ad

Getting all components to have the same "days since YYYY-MM-DD HH:MM:SS" units for time and time bounds is more important at this point, IMHO. If it's not unreasonable to have all components have the same "days since 0001-01-01 00:00:00", that would be really nice, but I agree that may be asking too much now.

Apr 04 '22 21:04 strandwg

@strandwg gave me an idea when talking about the cam.i files. What if we change the history files ".h" into either ".ha" for average (and min, max), and ".hi" for instantaneous? It seems like that would be a clearer way to figure out what type of data is on a given file stream? Do others like this idea? There's some precedence for this in CLM as CNDV created ".hv" files.

Apr 04 '22 22:04 ekluzek

What if we change the history files ".h" into either ".ha" for average (and min, max), and ".hi" for instantaneous? It seems like that would be a clearer way to figure out what type of data is on a given file stream? Do others like this idea? There's some precedence for this in CLM as CNDV created ".hv" files.

I like that idea. It will probably mean more changes to the user interface for setting history file variables: I'm imagining that now, instead of hist_fincl1, hist_fincl2, etc., we'd probably have histi_fincl1, histi_fincl2, hista_fincl1, hista_fincl2, etc. And instead of hist_dov2xy = .true.,.false. we'd have histi_dov2xy = .true.,.false. and hista_dov2xy = .true.,.false.; etc. (I guess we could avoid too many changes by dropping the a - so we'd have hist_ variables referring to the ha files and histi_ variables referring to the hi files, but I'd probably argue for just going ahead and ripping the band-aid off, even if it means users will need to adjust their scripts / user_nl files.) But my intuition is that, once those changes are done, that would make this easier to work with moving forward, both for users and for developers mucking in the history code.

Apr 04 '22 22:04 billsacks

I suggest steering clear of embedding more semantic information in filenames than we already do. CF provides the cell methods attribute to provide information about the temporal treatment of variables. This attribute gets propagated with data workflow software like NCO. Semantic information embedded in filenames can get dropped and/or modified unintentionally. Also, having (more) semantic content is multiple places (metadata and filename) is setting us up for having the content from different places being inconsistent, which can lead to confusion.

Apr 04 '22 23:04 klindsay28

@klindsay28 while I definitely see your point, my intuition is that distinguishing between instantaneous vs. time-averaged files might actually make the user interface to setting history-related namelist variables – via user_nl_clm, etc. – more clear and less error-prone both for users and developers needing to work with the relevant code. So I'm thinking about the advantages largely from that perspective.

Apr 04 '22 23:04 billsacks

A couple of brief comments.

I am getting confused between MUST HAVE, HIGH PRIORITY, and NICE TO HAVE (and possibly other levels). The more we pile on, the most expensive this work gets and I have yet to see any budget for this work, never mind any discussion of what we are going to stop doing to get this work done.
I do not think we have control over how all the components (e.g., MOM, CICE) work in terms of history output.

Apr 04 '22 23:04 gold2718

It's true that while cell_methods is the canonical means to determine if a field is averaged or instantaneous, that can only be determined by opening and inspecting the file. Given that the model now regularly outputs hundreds of fields, checking them all is inefficient and time-consuming, especially once the output has been transposed to timeseries format.

Since we need to have averaged and instantaneous fields in different output streams, adding a character or two to distinguish between the streams isn't too much to ask, given that it will enable users to determine which data are which quickly and easily.

Apr 05 '22 16:04 strandwg

adding a character or two to distinguish between the streams isn't too much to ask

Have you calculated how much SE work is involved in this? Writing new filenames is not that hard (although not easy since some models such as CAM do not have the infrastructure for multiple history file naming schemes so this is a bunch of work just to get there). However, these name changes must be tracked through every post-processing tool and archiving tool and somehow work with different, incompatible versions and components which implement the changes at different times.

What are we not going to do for CESM3 to get this done? Or are you extending the dates for the CESM3 release to incorporate these new requirements?

Apr 05 '22 18:04 gold2718

What if we change the history files ".h" into either ".ha" for average (and min, max), and ".hi" for instantaneous?

Does i stand for 'instantaneous' or for 'interval'? We have discussed dropping the term, 'instantaneous', in CAM history because that is not an accurate description of what 'I' does. Currently, that processing style outputs the last captured value, there is nothing particularly 'instantaneous' about it.

If we adopt a new file-naming scheme, let's please spend a bit of time to come up with names that have less potential for confusion.

Apr 05 '22 18:04 gold2718

I see a few priorities that have the potential to compete with one another:

Implement the changes detailed above with the possible exception of the Tier 3 change.
Minimize issues data users will have with the proposed changes. (Obviously though renaming some variables within codes will be necessary.)
Minimize work required by SE's.

Hopefully we can find a good balance. It is unlikely that we will be able to get complete consistency across component output due to MOM/CICE as @gold2718 correctly notes, and I'm sure there will be some other unforeseen issues that arise. But hopefully there will be a number of tasks that will be easier to implement, and as a result CESM data will be easier for all to use.

So it sounds like at present we have two options. Both would require separating A, M or X output from instantaneous (/last captured value).

Option 1

Keep history file names the same, all output goes in h0 (or h), h1, h2, etc.
Require instantaneous data to be in a separate stream (that would not house any A/M/X data). Checks would need to be written to enforce this behavior. time_bounds array would not be written to instantaneous streams.
Benefits: namelist changes are kept to a minimum (outside of users having to put instantaneous data in its own stream), and file names remain unchanged.
Drawbacks: Users would have to inspect the cell_methods attribute to verify the type of calculation, same as it is with CESM2.

Option 2

Alter history file names, so that instantaneous streams have their own distinct file names (whether "hi" or some other choice).
Require instantaneous data to be in a separate stream (that would not house any A/M/X data). Checks would need to be written to enforce this behavior. time_bounds array would not be written to instantaneous streams. (Same as in 1.)
Benefits: Users will know immediately if the history or timeseries file is instantaneous due to the different file name.
Drawbacks: More SE time required compared to option 1 (?), the changing of the naming of output streams may affect users more than if option 1 is implemented.

I'm sure I'm missing some benefits/drawbacks. I honestly see both sides here. However, if the SE development time required is considerably more for option 2 than 1, I lean towards option 1.

Apr 05 '22 20:04 phillips-ad

Thanks for laying that out clearly @phillips-ad . Question for you and others: How important is it for components to agree on option 1 vs. option 2? My vague sense is that component history naming conventions already differ for some components, though CAM and CTSM are still in pretty close agreement (maybe due to shared history of those components' history infrastructure). So I think that, before we get to option 1 vs 2 from your last comment, we need to decide between:

Option I: Let each component make its own choice on this question – i.e., whether to use h0, h1 etc. or hi0, ha0, hi1, ha1, etc. (or some other letters)

Option II: Try to at least keep CAM and CTSM consistent, but let other components possibly do something different (e.g., MOM & CICE can be harder to impose rules on because they are not CGD-led projects)

Option III: Try to get as much consistency as possible between as many CESM components as possible

Apr 05 '22 21:04 billsacks

However, if the SE development time required is considerably more for option 2 than 1, I lean towards option 1.

@phillips-ad, thanks for putting this together.

Does anyone have a list of post-processing uses of CESM data? It would be really helpful to have some idea of how big a job this is. A few items that come to mind are:

The short-term archiver (which has a history of being brittle when faced with filename changes).
The history-restart mechanism (for fields / files that have some processing other than 'last sample' (aka Instantaneous)).
Various diagnostics packages (I only really know about the ones used by atmosphere folks).

What else is out there? Discovery tools? Run databases? Post-processing tools such as time-series generators? Others?

Apr 05 '22 21:04 gold2718

Good question @gold2718 . I don't have an answer to your broad question, though I will say, on the subject of the short-term archiver, that the way it works now is pretty flexible in this respect. In addition @jedwards4b put in place some nice tests of this a few years ago that can give confidence that the archiving & restore functionality is working correctly when you do these renames. I recently had to change the naming convention of CISM's history output files (to enable multiple ice sheets), and this wasn't too hard to do, and I could have confidence that it was working correctly due to the testing @jedwards4b added. But I realize that only addresses the pieces that are integrated with the Case Control System, not pieces in external tools like diagnostics packages and other post-processing tools.

Apr 05 '22 21:04 billsacks

Thanks, @phillips-ad, for succinctly describing the two choices. One issue is that output file naming is only by convention, not enforced requirements. "cam.h0" is traditionally monthly-means, "cam.h1" is daily data, and h2 to h9 are whatever CAM's been configured to output. For CTSM, h0, h1 and h2 are monthlies, h3 and h4 are annual.

I hesitate to use one of the "h[0-9]" for instantaneous/last value output. Much cleaner and more obvious to have another denotation. "ha[0-9]" and "hi[0-9]" I think are the best choices.

@billsacks, there's no current requirement that each component's output has to resemble another's; CAM and CTSM and MOSART have .h[0-9]; CICE and CISM have .h (CICE has .h1 also), and MOM is yet different.

There may be cases in which some component other than CAM writes instantaneous/last values, but I've not observed that. Only CAM has a real need for instantaneous/last output.

@gold2718, at some point some years ago, CAM output was changed from "cam2.h[0-9]" to "cam.h[0-9]", without too much fuss, so I expect something similar will be done to adapt tools outside of CESM's control. Adding regular expressions generally takes care of the problem.

Apr 05 '22 21:04 strandwg

@gold2718, at some point some years ago, CAM output was changed from "cam2.h[0-9]" to "cam.h[0-9]", without too much fuss, so I expect something similar will be done to adapt tools outside of CESM's control. Adding regular expressions generally takes care of the problem.

Gosh, cam2 was well before CESM1 / CCSM4, nevermind CIME. The external infrastructure has all changed since those days.

But thanks for making our work sound easy. Does this mean you are volunteering to do this? I would be happy to hand over the history-restart problem as we are way overbooked.

Apr 05 '22 21:04 gold2718

There may be cases in which some component other than CAM writes instantaneous/last values, but I've not observed that. Only CAM has a real need for instantaneous/last output.

CTSM does have a few variables that are instantaneous by default, and the capability for any others to be written as instantaneous. For runs with a bunch of non-default output files (CMIP6 runs and other big production runs), instantaneous values are used for 10 or so variables. I don't think this capability is as widely used in CTSM as in CAM, but we do need to accommodate it.

at some point some years ago, CAM output was changed from "cam2.h[0-9]" to "cam.h[0-9]"

That reminds me... maybe we should take this opportunity to drop the "2" from "clm2.h*"... or even go so far as to rename from "clm2" to "ctsm".

Apr 05 '22 21:04 billsacks

@gold2718, at some point some years ago, CAM output was changed from "cam2.h[0-9]" to "cam.h[0-9]", without too much fuss, so I expect something similar will be done to adapt tools outside of CESM's control. Adding regular expressions generally takes care of the problem.

Gosh, cam2 was well before CESM1 / CCSM4, nevermind CIME. The external infrastructure has all changed since those days.

But thanks for making our work sound easy. Does this mean you are volunteering to do this? I would be happy to hand over the history-restart problem as we are way overbooked.

Apologies for the poor phrasing; what I meant was that when CAM output went from "cam2" to "cam", it wasn't too difficult to adapt postprocessing and other tools to that change.

Apr 05 '22 21:04 strandwg

That reminds me... maybe we should take this opportunity to drop the "2" from "clm2.h*"... or even go so far as to rename from "clm2" to "ctsm".

Ha, that's been around for a while now. Yes please. Either way (clm or ctsm).

Thanks for laying that out clearly @phillips-ad . Question for you and others: How important is it for components to agree on option 1 vs. option 2? My vague sense is that component history naming conventions already differ for some components, though CAM and CTSM are still in pretty close agreement (maybe due to shared history of those components' history infrastructure). So I think that, before we get to option 1 vs 2 from your last comment, we need to decide between:

Option I: Let each component make its own choice on this question – i.e., whether to use h0, h1 etc. or hi0, ha0, hi1, ha1, etc. (or some other letters)

Option II: Try to at least keep CAM and CTSM consistent, but let other components possibly do something different (e.g., MOM & CICE can be harder to impose rules on because they are not CGD-led projects)

Option III: Try to get as much consistency as possible between as many CESM components as possible

I think we would cause users some (initial) confusion if CAM/CTSM has stream names ha/hi, while other components keep h. If we switch to output file names with ha/hi, I'd rather all components move to that nomenclature. Right now, whether POP says "h" or CAM says "h0" there's at least a consistent file naming system across components. I think users are OK with the (minor) differences in file names across components in CESM2. So, if I'm to answer your question about which option, I guess option 3?

@gold2718 I can somewhat speak for diagnostic packages and timeseries generating packages. The older NCL-based component diagnostic packages are for the most part run as part of the CESM postprocessing suite. Timeseries generation can also be done with this suite, and works quite well. The problem with the suite is that it is hard to port, so it is mostly run on cheyenne. Plus there is no support for the suite, and I get the impression that most folks think this package will not be used for CESM3 development/post-processing.

As you know the replacement for the AMWG suite is the ADF, and development on that is quite active. I am unsure where other components are in the development of next gen diagnostics.

@strandwg also has his own postprocessing (timeseries-creation) suite, +cylc is also used to create timeseries and CMIP-timeseries.

I agree with @strandwg: I do not see any of these changes being that hard for present or future diagnostic/post-processing suites to handle, with the possible exception of the unsupported CESM_postprocessing suite.

Apr 05 '22 22:04 phillips-ad