cmor CMIP7 requirements: "branded variable" and new mip

(FYI @sashakames, @durack1,@matthew-mizielinski, @wolfiex even though this is primarily for Chris)

It looks likely that some changes to the output requirements for CMIP7 will be agreed shortly and that "branded variables" will be relied on in identifying variables in the cmor output files. It would be good to now consider how this might impact CMOR, so I'll raise this issue now:

How difficult would it be to implement the following?

The user specifies “frequency” as one of the entries in the CMIP6_input.json file rather than it being specified in a CMOR variable table. CMOR then handles “frequency” in the same way it handles, for example, “experiment_id”, and writes it as a global attribute. (We would also remove “frequency” and “approx_interval” from the CMOR variable tables.) I know that CMOR checks that users have sent a time coordinate that is approximately consistent with "approx_frequency", but that check could be dropped if it impairs implementation of this new approach.
The user specifies “region” as one of the entries in the CMIP6_input.json file and then CMOR handles it in the same way as, for example, “experiment_id” and writes it as a global attribute?
CMOR writes as a global attribute the "branding suffix", which it would need to obtain by extracting the suffix in the "table_entry" (i.e., the part following the underscore). See below for an example.
CMOR writes as global attributes the values of the elements comprising the branding suffix: temporal_sampling, vertical_sampling, horizontal_sampling, and area_sampling? These would be either be extracted, along with other metadata, from the CMOR variable table (as shown in the table example below), or could be obtained from a look-up table given the branding suffix.
In constructing file names and directory structure, rely on a somewhat different set of global attributes than in CMIP6. For example instead of including “table name” in the file name, include instead the “branded variable suffix”. (My guess is that this is trivially done by simply specifying a different template in the CMIP6_input.json file.)

To implement the above, new CMOR variable tables will need to be generated with the following changes (which could be implemented by someone other than Chris):

Remove "approx_interval" from the header of each table.
Remove “frequency” from each entry in the variable tables.
Replace all the variable “table_entries” with branded variable names.
Make sure out_name is set to the root name prefix of the branded variable (i.e., the part of the branded variable name preceding the underscore).
Add 5 new attributes to each variable in the tables: branding_suffix, temporal_type, vertical_type, horizontal_type, and area_type. These will be written by CMOR as global attributes in the netCDF files.
Reorganize and rename tables to group the variables more rationally and independently of frequency and region.

I should think most of the above changes to the variable tables should have little impact on the CMOR code itself.

A new CMOR7 table variable entry would include 5 new attributes (the first 5 lines below), and the "frequency" would be removed from the table (in CMIP6 it appeared just before the "long_name" attribute), resulting in the following:

"tas_tavg-z0-hxy-x": {

      "branding_suffix":"tavg-z0-hxy-x"
      "temporal_type":"mean"
      "vertical_type":"no vertical dimension"
      "horizontal_type":"gridded"
      "area_type":"unmasked"

      "cell_measures": "area: areacella",
      "cell_methods": "area: time: mean",
      "comment": "near-surface (usually, 2 meter) air temperature",
      "dimensions": [
        "longitude",
        "latitude",
        "time",
        "height2m"
      ],

      "long_name": "Near-Surface Air Temperature",
      "modeling_realm": [
        "atmos"
      ],
      "ok_max_mean_abs": "",
      "ok_min_mean_abs": "",
      "out_name": "tas",
      "positive": "",
      "standard_name": "air_temperature",
      "type": "real",
      "units": "K",
      "valid_max": "",
      "valid_min": ""
    },

Note that the table_entry has been changed from "tas" to the branded variable name: "tas_tavg-z0-hxy-x". Also note that the "out_name" will now without exception be just the root name (in this case tas) appearing before the underscore in the branded variable name. In CMIP6, sometimes the out_name differed from the table_entry.

We could elect to have CMOR generate "temporal_type", "vertical_type", "horizontal_type", and "area_type" by parsing the elements comprising the branding_suffix and then looking up in CVs the associated short text descriptions. That would mean these 4 global attributes would not have to be added to the existing tables.

Oct 04 '24 15:10 taylor13

Is this what the mip-cmor-tables will look like? Would the removal of "frequency" reduce the number of tables since they are currently grouped by modeling realm and frequency?

Are users supposed to select which "branded variables" from a table they are going to use instead of "variable_id"?

I assume "region" is going to be like "realm" in global attributes where its valid entries will be found in the CV, correct?

Will the "approx_interval" come from the CV or some other table? CMOR currently uses this value for a test.

Oct 04 '24 16:10 mauzey1

The tables will be structured the same as old tables with the changes I enumerated above. But, we can group variables into tables anyway we like (even placing them all into a single table, if we like), and instead of having a total of 2062 table entries (across tables), we’ll have about 1600 (because the same variable sampled at multiple frequencies will be found in only one table).

As I understand it, “variable_id” records the “out_name” found in the table, which is also the actual name of the variable array written to the netCDF file. That won’t change. As I noted, the out_name in the new tables will be the root name (i.e., prefix) of the branded variable name (e.g., “tas”, which is the prefix appearing in “tas_tavg-z0-hxy-x”)

As for realm, experiment_id, institute_id, etc., the valid regions will be found in a CV (and for CMIP7, there may only be a few options: “global”, “Antarctica”, “Greenland”, and a couple more perhaps.

We might decide to turn off the frequency check in CMIP7, which, as you say, is based on "approx_interval". Or we could provide a CV with "frequency" as the key, and the approximate interval as the value. The user would specify the "frequency" in the input table (as described above), and then CMOR would go to the frequency table and extract the approx_interval so it could perform its check. The frequency CV might look like:

“frequency” : {
      “mon” :  {
            “label” : “monthly”,
            “approx._interval”:“30”
      },
      “day” : {
             “label” : “daily”,
             “approx._interval”:“1”
      },
etc.
. 
. 
.

Oct 04 '24 22:10 taylor13

Please provide feedback and questions on the following. I've prepared an update of my earlier enumeration of possible changes to CMOR. A nicely-formatted version can be found at https://docs.google.com/document/d/1Hyv87wh0BS9dI0hSOydYubrsdpMe23qw3kCj1kLVuSo/edit?tab=t.0 , but i'll copy and paste here:

CMOR changes that are needed to handle “branded variables”:

Changes needed in user_input file:

User must define in this file the “frequency”, and CMOR must include “frequency” as a global attribute (drawn from a “frequency” CV).
User must define in this file the “region”, and CMOR must include “region” as a global attribute (drawn from a “region” CV).
Modify templates for filename and directory structure (not sure about the underscores):

output_file_template:
<variable_id>_<branding_suffix>_<frequency>_<region>_<grid_label>_<source_id>_<experiment_id>_<member_id>_<time_range>.nc

output_path_template:
<activity_id>_<source_id>_<experiment_id>_<member_id>_<region>_<variable_id>_<branding_suffix>_<grid_label>_<version>

Changes needed in CMOR table:

Use the full branded variable name as “entry” for each variable listed.
Add “brand_description” to each variable’s list of attributes.
Remove “out_name” and “frequency” from each variable in table.
Remove “approx_interval” and “mip_era” from table header. (mip_era is implied by data_specs_version.)
If not essential, remove “realm” from table header (but keep modeling_realm as variable attribute).
Update in header: Conventions (= ”CF-1.11 CMIP-7alpha”), data_specs_version (=”CMIP7.0.0.0-alpha”), cmor_version, table_id, and table_date

Changes needed in the CMOR code:

Remove check on time coordinate spacing, which relied on “approx_interval”. The value of approximate interval will be unknown to CMOR, so CMOR must not require it in any part of the code.
Read from input table and write as a variable attribute the “brand_description", "frequency", and ”region".
Parse the new branded variable table entries (relying on the underscore and hyphens) as follows (for sample entry:

“tas_tavg-2m-hxy-u”:
      branded_variable=“tas_tavg-h2m-hxy-u”
      out_name=”tas”  (this gets stored as the global attribute variable_id)
      branding_suffix=”tavg-h2m-hxy-u”
      temporal_label=”tavg”
      vertical_label=”h2m”
      horizontal_label=”hxy”
      area_label=”u”

Store each of the above as global attributes.

Sample new CMOR (or MIP) table:

{
    "Header": {
        "data_specs_version": "CMIP_specs7.0.0.0-alpha", 
        "cmor_version": "3.11???", 
        "table_id": "APmon???", 
        **** DELETE: "realm": "atmos atmosChem", 
        "table_date": "???", 
        "missing_value": "1e20", 
        "int_missing_value": "-999", 
        "product": "model-output", 
        **** DELETE: "approx_interval": "30.00000", 
        "generic_levels": "alevel alevhalf", 
        **** DELETE: "mip_era": "CMIP6", 
        "Conventions": "CF-1.11 CMIP-7alpha???"
    }, 
    "variable_entry": {
        "hfss_tavg-u-hxy-u": {              [NOTE: OLD "ENTRY" HAS BEEN REPLACED  WITH BRANDED VARIABLE.]
            "brand_description": "surface upward sensible heat flux: time means reported on a 2-d 
                        horizontal grid"          [NOTE: THIS IS A NEW ATTRIBUTE.]
            **** DELETE: "frequency": "mon", 
            "modeling_realm": "atmos", 
            "standard_name": "surface_upward_sensible_heat_flux", 
            "units": "W m-2", 
            "cell_methods": "area: time: mean", 
            "cell_measures": "area: areacella", 
            "long_name": "Surface Upward Sensible Heat Flux", 
            "comment": "The surface sensible heat flux, also called turbulent 
                    heat flux, is the exchange of heat between the surface 
                    and the air by motion of air.", 
            "dimensions": "longitude latitude time", 
            **** DELETE: \"out_name": "hfss", 
            "type": "real", 
            "positive": "up", 
            "valid_min": "", 
            "valid_max": "", 
            "ok_min_mean_abs": "", 
            "ok_max_mean_abs": ""
        },

Note: all variable entries will be similar, but there may be one or two cases where attributes “flag_values” and “flag_meanings” are defined in addition to the above.

Implications for data request:

If the branded variable names and the new MIP table names are not provided by the data request, then whatever variable labels are provided (e.g., root name and CMIP6 table name) will need to be translated into branded variable names and new MIP table names. This, presumably could be done relying on a look-up table.

Feb 13 '25 01:02 taylor13

@taylor13 @mauzey1 we'll need to think about how best to enable (if possible) backward compatibility, the comments in #771 are relevant here, particularly the use of the _cmip6_option optional argument to CMOR

Feb 13 '25 03:02 durack1

I just noticed that the table entries that had been shown to be deleted in the original google doc lost the "strike through" marks when I copied into this issue. I've now edited the sample CMOR table segment above indicating which entries in the current CMIP table should be deleted.

Feb 13 '25 19:02 taylor13

I've reviewed https://github.com/PCMDI/cmor/issues/762#issuecomment-2655247611 and found it needs to be tweaked. Again, a nicely-formatted version can be found at https://docs.google.com/document/d/1Hyv87wh0BS9dI0hSOydYubrsdpMe23qw3kCj1kLVuSo/edit?tab=t.0 .

For CMIP7 we expect to define 8 MIP tables, one for each realm. Here is a sample header and single entry from the "atmos" table.

{
    "Header": {
        **** MOVE TO input.json FILE: "data_specs_version": "CMIP_specs7.0.0.0-alpha", 
        "checksum":"",   **** This is a new entry to the header and will normally contain a checksum value
        "cmor_version": "3.10???", 
        "table_id": "atmos", 
        "realm": "atmos",   **** This sets a realm default value that can get overridden for individual variables.
        "table_date":"2025-02-14", 
        "missing_value": "1e20", 
        "int_missing_value": "-999", 
        "product": "model-output", 
        **** DELETE: "approx_interval": "30.00000", 
        "generic_levels": "alevel alevhalf", 
        **** MOVE TO input.json FILE: "mip_era": "CMIP6", 
        "Conventions": "CF-1.11 CMIP-7alpha???"
        "type":"real",     **** This and the following 5 attributes are default values that can be overridden for individual variables.
        "positive":"",
        "valid_min":"",
        "valid_max":"",
        "ok_min_mean_abs":"",
        "ok_max_mean_abs":"",
    }, 
    "variable_entry": {
        "hfss_tavg-u-hxy-u": {              [NOTE: OLD "ENTRY" HAS BEEN REPLACED  WITH BRANDED VARIABLE.]
            "long_name": "surface upward sensible heat flux: time means reported on a 2-d 
                        horizontal grid"         
            **** DELETE: "frequency": "mon", 
            **** DELETE: "modeling_realm": "atmos", 
            "standard_name": "surface_upward_sensible_heat_flux", 
            "units": "W m-2", 
            "cell_methods": "area: time: mean", 
            "cell_measures": "area: areacella", 
            "variable_title": "Surface Upward Sensible Heat Flux",   ****THIS IS A NEW ATTRIBUTE, but I'm not sure it will actually get written to the file;  can it be ignored?
            "comment": "The surface sensible heat flux, also called turbulent 
                    heat flux, is the exchange of heat between the surface 
                    and the air by motion of air.", 
            "dimensions": "longitude latitude time", 
            "out_name": "hfss", 
            "positive": "up", 
        },

QUESTIONS ABOUT CMOR (I've asked "yes" or "no" questions, but the real question is "how difficult would it be to make the suggested changes?"):

Can we move "data_specs_version" and "mip_era" global attributes from the table header to the CMOR "CMIP7_input.json" file? When this is done, we need to check that the dataset "
Currently the "realm" is given in the header and "modeling_realm" is given for each variable. How do these differ and how are they treated by CMOR. This is a global attribute that usually will have a single value for all variables in the table, but there might be some exceptions. Can we specify in the header a default value and possibly override it (or not) under some individual variables.
We specify an "approx_interval" (for time-step) in the header so that CMOR can check whether the time-coordinate values are approximately correct. Can we remove this and eliminate this capability from CMOR?
There are currently 6 variable attributes that for most variables are set to a single value ("real" for "type" and "" for "positive", "valid_max", "valid_min", "ok_max_mean_abs", and "ok_min_mean_abs"). Can we specify these default values in the header and allow them to be overridden for an individual variable?
There are at least two options for handling the new table entry (e.g., tas_tavg-h2m-hxy-u):- Preferred option: Parse the elements separated by "_" or "-" and store as global attributes:

          branding_suffix="tavg-h2m-hxy-u"
          temporal_label = "tavg"
          vertical_label="h2m"
          horizontal_label="hxy"
          area_label="u"
          variable_id="tas"  (Alternatively, this might be named "out_name" and handled as before, I think.)

Other option: Put the parsed elements defined above directly into the cmor tables, but that increases file size by about 50% and makes it harder for humans to browse it quickly.

We need to handle "frequency" differently in CMIP7. We need to eliminate it from the CMOR table. We need to enable the user to specify "frequency" and another attribute, "region". There are two options:

Preferred option: When calling "cmor_variable", user passes the "frequency" and "region" along with the required variable name (key to the cmor table definition of a variable). These attributes would be stored as global attributes and also be used in constructing filenames and directory structures.
Other option: Add "frequency" and "region" to the input.json file and handle like other global attributes. Data providers would not like this though because in processing a single simulation, they would have to alter the input.json file several times; in previous phases, the same input.json file table would serve all variables from a single simulation.

Can we add "variable_title" as new attribute and have CMOR write it as a variable attribute? Can we add it and not have CMOR write it?
Certain global and variable attributes should be stored as floats or integers, not "strings". Is that currently possible with CMOR? (I think at least for variables CMOR stores missing_value as a non-text-string.)
Are flag_values and flag_meanings needed by any variables? Are they needed by coordinate variables?
Can we modify the templates for filename and directory structure in the input.json file and then populate it from user input and cmor table information: output_file_template:`` <variable_id><branding_suffix><grid_label><source_id> <experiment_id><member_id><time_range>.nc

- output_path_template:

      <activity_id>_<source_id>_<experiment_id>_<member_id>_<region>_<variable_id>_
                         <branding_suffix>_<grid_label>_<version>

11. Include in the header a checksum value.  In a future version of CMOR, we might record the checksum in the files written by CMOR, and perhaps also ask CMOR to check whether the value in the header is consistent with a value CMOR obtains by performing checksum on the cmor table.  For now, CMOR can completely ignore ``checksum``, but it should not mind that it appears in the header.

Feb 21 '25 00:02 taylor13

As far as priority for the above, the following are essential for CMIP7: 3, 5, 6, and 10.

Feb 21 '25 16:02 taylor13

I thought of another approach for addressing 5 and 6 that would not involve modifying existing cmor functions.

For item 5, we could require the data provider (user) to call a new cmor function, which we could name "cmor_treat_brand". We would call it right after function "cmor_variable". The only argument of the function would be:

var_id = integer returned by cmor_variable identifying the variable of interest

The function would

use the var_id to extract the brand name for the variable (e.g., "tas_tavg-h2m-hxy-u")
parse the brand to obtain:

          branding_suffix="tavg-h2m-hxy-u"
          temporal_label = "tavg"
          vertical_label="h2m"
          horizontal_label="hxy"
          area_label="u"
          variable_id="tas"

For item 6, after a call to "cmor_variable", we would require the user to call cmor function "cmor_set_variable_attribute" twice:

cmor_set_variable_attribute(var_id, "frequency", "c", value), where value is taken from the frequency CV (e.g., "mon", "day", "6hr", ...)
cmor_set_variable_attribute(var_id, "region", "c", value), where value is taken from the region CV (e.g., "glb", "ant", "grn")

This is really no different than doing these things inside "cmor_variable", as I suggested in the earlier comment, but this would not modify any of the existing cmor functions.

Feb 22 '25 16:02 taylor13

@taylor13 Answering your questions from https://github.com/PCMDI/cmor/issues/762#issuecomment-2673038397

Can we move "data_specs_version" and "mip_era" global attributes from the table header to the CMOR "CMIP7_input.json" file? When this is done, we need to check that the dataset "

data_specs_version is meant to be the version of the CMOR MIP tables being used so it should be part of the table header rather than the user input. I can see mip_era becoming a user input parameter that is checked by the CV.

Currently the "realm" is given in the header and "modeling_realm" is given for each variable. How do these differ and how are they treated by CMOR. This is a global attribute that usually will have a single value for all variables in the table, but there might be some exceptions. Can we specify in the header a default value and possibly override it (or not) under some individual variables.

When defining the realm attribute, CMOR will first check if the variable entry provides a realm value from the modeling_realm attribute. If CMOR doesn't find one, then it will get the value from the realm attribute in the current table's header.

We specify an "approx_interval" (for time-step) in the header so that CMOR can check whether the time-coordinate values are approximately correct. Can we remove this and eliminate this capability from CMOR?

Yes.

There are currently 6 variable attributes that for most variables are set to a single value ("real" for "type" and "" for "positive", "valid_max", "valid_min", "ok_max_mean_abs", and "ok_min_mean_abs"). Can we specify these default values in the header and allow them to be overridden for an individual variable?

Yes. We can follow a similar approach that CMOR takes with realm.

There are at least two options for handling the new table entry (e.g., tas_tavg-h2m-hxy-u):- Preferred option: Parse the elements separated by "_" or "-" and store as global attributes:
     branding_suffix="tavg-h2m-hxy-u"
     temporal_label = "tavg"
     vertical_label="h2m"
     horizontal_label="hxy"
     area_label="u"
     variable_id="tas"  (Alternatively, this might be named "out_name" and handled as before, I think.)
Other option: Put the parsed elements defined above directly into the cmor tables, but that increases file size by about 50% and makes it harder for humans to browse it quickly.

How about we just have the attribute branding_suffix in the variable's table entry? This will allow for backwards compatibility in CMOR by checking for the attribute before proceeding with parsing the elements within the suffix. The CMIP6 tables will skip this check since they won't have the attribute.

We need to handle "frequency" differently in CMIP7. We need to eliminate it from the CMOR table. We need to enable the user to specify "frequency" and another attribute, "region". There are two options:

Preferred option: When calling "cmor_variable", user passes the "frequency" and "region" along with the required variable name (key to the cmor table definition of a variable). These attributes would be stored as global attributes and also be used in constructing filenames and directory structures.

Other option: Add "frequency" and "region" to the input.json file and handle like other global attributes. Data providers would not like this though because in processing a single simulation, they would have to alter the input.json file several times; in previous phases, the same input.json file table would serve all variables from a single simulation.

We can do what you suggested in https://github.com/PCMDI/cmor/issues/762#issuecomment-2676291146 and use cmor_set_variable_attribute to set the frequency and realm for a variable. If frequency and realm are not defined when you use cmor_write then an error message should be raised.

Can we add "variable_title" as new attribute and have CMOR write it as a variable attribute? Can we add it and not have CMOR write it?

We can make CMOR add the attribute if we want. CMOR will ignore the attribute in the variable's table entry if it is not programmed to find it.

Certain global and variable attributes should be stored as floats or integers, not "strings". Is that currently possible with CMOR? (I think at least for variables CMOR stores missing_value as a non-text-string.)

Yes, purely numeric values (i.e. numbers without units) for attributes are stored in netCDF files as floats or integers.

Are flag_values and flag_meanings needed by any variables? Are they needed by coordinate variables?

The "basin" variable in the CMIP6_Ofx.json table is the only variable that I know that has the flag_values and flag_meanings attributes.

Can we modify the templates for filename and directory structure in the input.json file and then populate it from user input and cmor table information: output_file_template:`` <variable_id><branding_suffix><grid_label><source_id> <experiment_id><member_id><time_range>.nc

output_path_template:

<activity_id><source_id><experiment_id><member_id><variable_id> <branding_suffix><grid_label>

Yes, we can modify the filename and directory templates to use the branding_suffix attribute value.

Include in the header a checksum value. In a future version of CMOR, we might record the checksum in the files written by CMOR, and perhaps also ask CMOR to check whether the value in the header is consistent with a value CMOR obtains by performing checksum on the cmor table. For now, CMOR can completely ignore checksum, but it should not mind that it appears in the header.

CMOR currently creates a MD5 checksum of the variable table used to write a netCDF file. This checksum is stored in the attribute table_info along with the table file's creation date. This value is currently not checked when running a file through PrePARE. Do we want to create a new checksum attribute? Perhaps we could make a SHA256 sum attribute (just the checksum, no date) for the MIP table's checksum. We can make PrePARE only check this parameter if it is present in the netCDF file.

Feb 24 '25 21:02 mauzey1

Thanks for clarifying everything. A few follow-up questions/remarks:

Regarding

data_specs_version is meant to be the version of the CMOR MIP tables being used so it should be part of the table header rather than the user input. I can see mip_era becoming a user input parameter that is checked by the CV.

CMOR MIP tables are no longer relied on for identification of datasets or files, so their contents can be modified under the same "data specifications", and I don't think it is important that we define a version of the tables. What is essential is to record the entire set of data specifications that govern the metadata in the netCDF files, the templates for constructing paths and filenames, and CVs relied on by those using CMOR. I think the name "dataset_specs_version" is an appropriate name to describe the overarching data specifications, so thought we could just repurpose it.

I know in the past data_specs_version would change if the contents of the tables changed, but is there any reason for that now? Maybe others have an opinion about this.

Regarding "realm" and "modeling_realm", can we make the names consistent? So if "realm" isn't found under the variable entry, then the "realm" specified in the header is used?
Regarding

How about we just have the attribute branding_suffix in the variable's table entry? This will allow for backwards compatibility in CMOR by checking for the attribute before proceeding with parsing the elements within the suffix. The CMIP6 tables will skip this check since they won't have the attribute.

In my example, I use the full branded variable name (root name + branding suffix) as the table entry: "hfss_tavg-u-hxy-u", where "hfss" is the root name (variable_id) and "tavg-u-hxy-u" is the branding suffix. As a service for down-stream users of the data, I wanted to separately store as global attributes these two elements and then parse the suffix and also separately store temporal_label, vertical_label, horizontal_label, and area_label.

As for backward compatibility, if you wanted to handle the old CMIP6 tables with this version of the code, you would need to check whether an underscore were found in the variable entry. If not, then you would skip any parsing or storing of the elements.

I've probably misunderstood something, so will be interested in whether this seems like a good course or not.

Regarding

We can make CMOR add the attribute if we want. CMOR will ignore the attribute in the variable's table entry if it is not programmed to find it.

Yes, that is clear. Does the input file determine which attributes CMOR looks for, or is that hardwired inside the code?

Regarding checksums, I guess the immediate question is "can you include a checksum (or any other attribute) in the header and have CMOR just ignore it"? Or will CMOR error exit if it finds something in the header it doesn't know about?

Feb 25 '25 00:02 taylor13

In my example, I use the full branded variable name (root name + branding suffix) as the table entry: "hfss_tavg-u-hxy-u", where "hfss" is the root name (variable_id) and "tavg-u-hxy-u" is the branding suffix. As a service for down-stream users of the data, I wanted to separately store as global attributes these two elements and then parse the suffix and also separately store temporal_label, vertical_label, horizontal_label, and area_label.

As for backward compatibility, if you wanted to handle the old CMIP6 tables with this version of the code, you would need to check whether an underscore were found in the variable entry. If not, then you would skip any parsing or storing of the elements.

I've probably misunderstood something, so will be interested in whether this seems like a good course or not.

As long as we don't need to worry about variable names containing _ or -, then it should be easy to parse the suffix and its components.

From the Naming Conventions section of CF-Conventions:

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores.

Are there any cases of variable names with underscores?

Feb 25 '25 20:02 mauzey1

No, for CMIP the only characters allowed in variable names are alphanumeric characters; no punctuation, underscores, or hyphens.

Feb 25 '25 21:02 taylor13

Chris, from the above, it appears that the changes we're contemplating would not render CMOR unable to process CMIP6 and CMIP6Plus data. That would be great, but is it true?

Feb 25 '25 22:02 taylor13

@mauzey1 the placeholder MIP table files, following the new format can be found in #778

Feb 26 '25 01:02 durack1

I thought of one additional change from CMIP6 that might possibly have an impact on CMOR. In the past the "initialization_index" set by the user in the "input.json" file was invariably an integer. Now it will be an integer of up to 6 digits, possibly followed by a single lower case letter. We will not permit that letter to be r, i, p, or f, because that would make parsing the variant label difficult. Examples of legal initialization indexes are "198001" or 198001a". I think you use this index in constructing the "variant_label" following a template ("variant_label") specified in the CMIP6_CV.json file (see below). There could also be a change in specification in the input.json file of the parent_variant_label from something like "r3i1p1f1" to "r3i199508ap1f1", but I suspect that just gets read and then written as a global attribute.

Besides this change in the index format, it is possible that there will be some implications of changing in the CMIP6_CV.json file to these two templates:

  "variant_label":[
            "r[[:digit:]]\\{1,\\}i[[:digit:]]\\{1,\\}p[[:digit:]]\\{1,\\}f[[:digit:]]\\{1,\\}$"
        ],
  "initialization_index":[
            "^\\[\\{0,\\}[[:digit:]]\\{1,\\}\\]\\{0,\\}$"

Mar 04 '25 21:03 taylor13

In the above, I originally misnamed the global attributes for the 4 defining the branding suffix. I had named them "x_type" but they should be "x_label", to be consistent with other documents. I've gone back and edited by comments to be consistent with the correct usage, replacing, for example, "temporal_type" with "temporal_label".

Mar 05 '25 01:03 taylor13

@durack1 @taylor13

Did we want to move temporal_label, vertical_label, horizontal_label, and area_label out of the branding_labels object and just be parts of the CV object? This would simplify the validation of the labels since they would use the same validation function as the other global attributes. The only attribute that would need to be treated differently would be the branding_suffix attribute since it would be composed of the 4 branding labels in a specific pattern.

Mar 12 '25 22:03 mauzey1

If I understood the conversation earlier this afternoon, I think the answer is yes. There is no need in the CV_CMIP7.json file to group them under "branding_labels", so that can be removed, and that portion of the file will be flattened, allowing a more uniform validation. And, yes, I think branding_suffix requires special treatment. Should we specify its template in the CV_CMIP7.json file?

branding_suffix=<temporal_label>-<vertical_label>-<horizontal_label>-<area_label>

Mar 12 '25 23:03 taylor13

@taylor13 @durack1

In the subhour frequency tables, there are attributes approx_interval_error and approx_interval_warning.

        "approx_interval_error": "0.90", 
        "approx_interval_warning": "0.5",

https://github.com/PCMDI/cmip6-cmor-tables/blob/9b0b36d38c07b746159cadbc920bc7251278696e/Tables/CMIP6_CFsubhr.json#L12-L13 https://github.com/PCMDI/cmip6-cmor-tables/blob/9b0b36d38c07b746159cadbc920bc7251278696e/Tables/CMIP6_Esubhr.json#L12-L13

CMOR sets approx_interval_error to .2 and approx_interval_warning to .1 by default for all other frequencies. These are percentages used by the function cmor_check_interval to check the time axis intervals with the approx_interval value.

https://github.com/PCMDI/cmor/blob/0d05a7780ea72e6e7b51a35cef40aba0003b5dde/Src/cmor_axes.c#L1508-L1553

Will there be subhour frequenices? Even if there are no subhour frequencies, maybe we should anticipate approx_interval_error and approx_interval_warning for frequencies if they are present.

Mar 14 '25 18:03 mauzey1

I was unaware that the tables could control how close the time coordinate spacing should match the expected spacing. I think we might eliminate this option for now and simply rely on the default values. Perhaps, however, you haven't the ability to set default values individually for each frequency, so that wouldn't really work.

If that's the case, then I guess one option would be:

Include in the CMIP7_CV.json file, along with description and approx_interval, the approx_interval_error and approx_interval_warning values for each of the allowed frequency labels ("mon", "day", "6hr", etc.).
Once the user had "set global attribute frequency" (the new thing you've recently added to CMOR for CMIP7), then CMOR would look up in the CMIP7_CF.json file the corresponding approx_interval_error and approx_interval_warning.
If this were implemented, I suppose the default values could be eliminated.

Would that be difficult?

By the way, in the code highlighted above, there seems complete duplication under the first "tmp>" if test of what to do if isbounds==1. Looks to me like one of the first two inner "ifs" could be eliminated without changing any outcomes.

Open to other suggestions on how to proceed.

Mar 14 '25 22:03 taylor13

@taylor13 approx_interval_error and approx_interval_warning can still have default values that are used if they are not provided by the CV similar to how it is done with the header values from tables.

By the way, in the code highlighted above, there seems complete duplication under the first "tmp>" if test of what to do if isbounds==1. Looks to me like one of the first two inner "ifs" could be eliminated without changing any outcomes.

Yes, the second if (isbounds == 1) is unreachable so I removed it.

Mar 14 '25 22:03 mauzey1

https://github.com/PCMDI/cmor/blob/0d05a7780ea72e6e7b51a35cef40aba0003b5dde/TestTables/CMIP7_CV.json#L37-L39

Can we rename this section as archive_id to match the required global attribute name? Or rename archive_id to data_archive_id in required global attributes? Either way will make it easier for CMOR to validate the attribute.

Mar 19 '25 22:03 mauzey1

yes please change data_archive_id to archive_id. thanks, Karl

Mar 20 '25 22:03 taylor13

@taylor13 Should the attribute sub_experiment_id also be included in the CV. The experiment_id entry in CMIP7_CV.json has a sub_experiment_id section. Do we plan to have subexperiment attributes in CMIP7?

Mar 20 '25 22:03 mauzey1

sub_experiments will not exist in CMIP7, and any references to them can be removed wherever they are found. thanks for checking.

Mar 20 '25 22:03 taylor13

@taylor13 For the realm attribute, do we expect there to be sometimes multiple values listed? For example. realm: "ocean seaIce". I'm trying to get CMOR to validate global attributes defined in the CV as a list within a JSON object. Currently, it just ignores values that are not listed in these objects.

I can make CMOR throw an error if a global attribute value is not found as the key within a CV object. However, it will only allow for single-value attributes. Multiple-value attributes like source_type would cause an error but this is treated as a special case.

When running the CMOR tests, I encountered a test that used two values for the realm attribute, ocean seaIce. Should we consider realm to be another special case where multiple valid values can be in one string?

Mar 28 '25 18:03 mauzey1

Yes, realm can list more than value. so can source_type. Can't think of any others that allow more than one value. thanks, Karl

Mar 28 '25 23:03 taylor13

@taylor13 For parent attributes, should we use the same logic for the validation of CMIP6 parent attributes for CMIP7? One issue with this is that the current implementation expects parent_mip_era to be CMIP6. Should it be changed so that the parent_mip_era comes from the CV?

Mar 31 '25 21:03 mauzey1

Depends on what it currently does for CMIP6. Does it look in the CMIP6_experiment_id.json file, and given the experiment_id look up what the parent_experiment_id should be? If so, we could transfer that to the the new CMIP7-CVs_experiment.json file ( currently being developed/populated here: https://github.com/WCRP-CMIP/CMIP7-CVs/blob/main/CMIP7-CVs_experiment.json).

Another option would be to simply check the user-provided parent_experiment_id has been registered in CMIP7-CVs_experiment.json (i.e., is a valid experiment) without checking whether it should in fact be the parent of a particular experiment_id. That would be less of a check on it, and so not as valuable as the first option.

Apr 01 '25 18:04 taylor13

The parent attributes are validated by two methods.

The experiment ID checking function. parent_activity_id and parent_experiment_id are checked against the values in experiment_id's CV entry.
The parent attribute checking function. If a parent_experiment_id is no parent, then all other parent attributes must be no parent and that branch_time_in_parent and branch_time_in_child are valid. Otherwise, check the values of the following attributes.
- parent_activity_id
- parent_source_id
- parent_variant_label
- parent_time_units
- parent_mip_era
- branch_method

I think both methods should work for CMIP7. The only problem is that parent_mip_era is hard-coded to be either no parent or CMIP6. Is parent_mip_era ever a value other than the dataset's mip_era value or no parent?

https://github.com/PCMDI/cmor/blob/ebbf0ffca574dece42c8fdc834939c1bb97ee202/Src/cmor.c#L3167-L3173

Should the checking of _cmip6_option only be used for checking subexperiments? All of the other checks work with CMIP7 except subexperiments. This might interfere with other projects that don't use these attribute checks, which I assume was the reason for the _cmip6_option check.

    ierr += cmor_CV_checkSourceID(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkExperiment(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkGrids(cmor_tables[nVarRefTblID].CV);
    ierr += cmor_CV_checkParentExpID(cmor_tables[nVarRefTblID].CV);
    if (cmor_has_cur_dataset_attribute(GLOBAL_IS_CMIP6) == 0) {
        ierr += cmor_CV_checkSubExpID(cmor_tables[nVarRefTblID].CV);
    }

Apr 01 '25 22:04 mauzey1

CMIP7 requirements: "branded variable" and new mip_table specification