GeoZarr domain
I have been digging around in GeoZarr lately and I have found many things that I like. At the same time, I am also quite concerned about what I see as a major deficiency. I'll elaborate after I have explained where this is coming from.
Just having written an R package to interpret the CF Metadata Conventions applied to netCDF files, I have an understanding of CF that goes well beyond reading the conventions document. Wherever my code becomes overly complex, CF is failing. My code is very complex.
Apart from being a complex set of conventions, CF is also very imprecise. It has evolved over a period of nearly 25 years, although the core has not evolved much and is based on the 1995 COARDS conventions for netCDF files. These COARDS conventions do not exude a feeling of thorough analysis and design, but rather feel like a minimalistic set of conventions to avoid a disconnect between data producers and data users.
Trying to find a mechanism to map CF conventions to GeoZarr without a critical assessment of where CF is lacking risks the situation that its design failures and omissions are perpetuated in GeoZarr. I know that more than a few people in the CF community are looking over here precisely to overcome the problems associated with CF so that would be a sad development indeed.
Two recent discussion threads in this site drew my attention: (#84) how to deal with CF scalar coordinate variables and (#90) CRS in GeoZarr. While it may not be immediately obvious, these are, or at least should be, related. I'll start with the scalar coordinate variables as it nicely encapsulates most of what I will argue later on.
Section "5.7. Scalar Coordinate Variables" starts out like this:
When a variable has an associated coordinate which is single-valued, that coordinate may be represented as a scalar variable (i.e. a data variable which has no netCDF dimensions). Since there is no associated dimension these scalar coordinate variables should be attached to a data variable via the coordinates attribute.
This paragraph is imprecise in pretty much every clause. A selection: (1) "a variable": is this a netCDF variable (as in a zarr array) or one of the 11 CF variables that are implemented in a netCDF variable? The CF guru will immediately say "data variable, obviously!" so why not say that? (this may sound petulant but I have brought it up in CF Github discussions and this is the answer I have been getting); (2) "has an associated coordinate": what is a coordinate? and there is no association between a variable and a coordinate. Obviously, this conceptually refers to a "coordinate variable" but that term is already defined as having a dimension, so we are out of words here, obviously; (3) "scalar variable (i.e. a data variable which has no netCDF dimensions)" is both imprecise and plain wrong: a "scalar variable" does not exist (it should be a "scalar coordinate variable") and it is most definitely not a "data variable".
Petulant bickering over language aside, if I rewrite this paragraph, this is what it would be:
An axis of length 1 may be represented as a scalar netCDF variable. Since such an axis has no dimension it must be associated with a data variable via the latter's "coordinates" attribute.
The problem with the above is that CF does not define what an axis is, nor how it relates to data variables (not talking about the "axis" attribute here). This is a major design omission and a source of much complication, not least because CF uses the term "axis" throughout the document. There is even a "discrete axis", which is distinct from the other "coordinate types" because it is "discrete" (more imprecision: this is an "identity axis" which has discrete values). I'll come back to this lack of a definition of what an "axis" is.
Section 5.7 continues with this:
The use of scalar coordinate variables is a convenience feature which avoids adding size one dimensions to variables.
I would not call it a convenience feature. It is often informative to associate additional data with an array. For instance, an array with dimensions ["longitude", "latitude", "time"] of surface temperature could have a scalar "height" axis to indicate that the reference height is 2m. Same goes with "time" for satellite imagery, for instance. This could also be solved with an attribute, but using scalar axes has the advantage of being consistent in representing the domain of a data variable. Plus, adding a length 1 dimension to an array pushes semantics into the structural layer of the array, which I would consider bad practice (and it may not be fully portable either).
And there I used the magic concept: domain. CF does support the domain concept, but only half-heartedly:
The purpose of a domain variable is to provide domain information to applications that have no need of data values at the domain’s locations, thus removing any ambiguity when retrieving a domain from a dataset.
Hmmmmm, really? I have not come across netCDF files with a domain variable and the purpose as presented might explain: it is superfluous because it describes the same things as a data variable would. There is one case that you are forgiven for not knowing about: the CFA extension for aggregation of multiple netCDF files into one logical view: the aggregation file has no data variables. This extension has been in the works for a number of years now and it may be integrated into CF at some point in the future.
So how does CF construct the domain of its data variables? The basic case is simple: the dimensions of the data variable. Each dimension must have an associated coordinate variable of the same name. The order in which the dimensions are given in the data variable is the order in which the coordinate variables apply. But now enter scalar coordinate variables. As we've seen above, these are specified in the "coordinates" attribute of the data variable. And this is where it all falls off the rails.
The "coordinates" attribute of a data variable is used to record at least six distinct elements: (1) optionally and repetitively, the names of the dimensions; (2) scalar coordinate variables; (3) two-dimensional coordinate variables (always in a pair, for some added complication); (4) labels (associated with a coordinate variable but recorded with a data variable); (5) alternative coordinates (CF running of of creative names at this point); and (6) a typevar (that's not even a name) for cell methods. The actual coordinate space must be inferred from attributes associated with the various types of variable, such as the "axis" attribute for the orientation [X, Y, Z, T, others] and the direction from inspection of the values of the coordinate variable. There is a whole lot of indirection in there rather than explicit definition.
This arrangement is very clearly demonstrating a lack of design rigour early on, or, in more friendly terms, that the scope of CF-compliant datasets has expanded over time and requiring new descriptors that were fitted in the current attribute set by expanding the scope of what can be represented by it. Either way, the current arrangement is faulty.
So where does that leave GeoZarr? Not in a good place, unfortunately. Like CF, GeoZarr does not define an axis. Like CF, GeoZarr uses coordinate variables, without the context provided by an explicitly defined domain. I see a slavish copy of the COARDS convention from 1995 and I can assure you that this is complicating matters at the level of implementers.
It doesn't have to be like that. GeoZarr can, and should, learn from the deficiencies in past standards and conventions, otherwise what is the point? And it doesn't have to be difficult or create incompatibilities.
I would strongly advocate to develop "domain" and "axis" constructs for GeoZarr. Simply said, a domain is an ordered collection of axes and a CRS (to refer back to the second discussion thread). The axes reference all of the dimensions in the array but there may be additional scalar axes. This can all be implemented by a few attributes, either associated with a single array or placed in a group to be shared among all arrays in the group and nodes below that group. An axis is a 1D array with its own domain to describe its properties in addition to the values in the array (the values do not necessarily equate to the coordinates, as in the case of time coordinates). If it is regular, as is quite common, the axis may also be fully described in the metadata of the array domain (compressed to [start:step:end]). The CRS can be a WKT2 string.
That's clean and that's logical. Hopefully this can be considered for inclusion into GeoZarr.
Great name! I call domain "grid" in tidync. I should rename that. "Shape" is not a construct in netcdf but some docs do refer to it (a set of specific dimensions, not just their sizes). xarray presents this logic better that ncdump ever did: https://github.com/ropensci/tidync/issues/127#issuecomment-2321545887
Hi Patrick,
Petulant bickering over language aside
Sorry - can't let this go. I think that nearly every point you mention is incorrect, and is clearly addressed or refuted by a more careful reading of the text and the CF data model. Your misgivings appear to be related to the CF-netCDF encoding, which is not really the point in this repo, as I understand it.
It is unhelpful to make assertions that CF is wrong or poorly written, and then offer incorrect or speciously reasoned alternatives. CF is always open to improving the clarity of its text (as you know - some of your own improvements will be in the next CF release), but to imply that CF is "wrong" because you don't like the way certain sections are written is not a constructive way forward.
I'm not going to address every one of your points (there are too many of them, and they are better debated as CF repo discussions), but I will make the general observations that one of the reasons we have a CF date model is "should netCDF ever fail to meet the community needs, the groundwork for applying CF to other file formats will already exist." (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.12/cf-conventions.html#data-model-design-criteria). GeoZarr seems to be this case in action! and in particular, the CF date model clearly defines "domain" and "domain axis", and their relationships to encoded scalar coordinate variables (i.e. scalar coordinates do not exist in the data model).
Cheers, David
is not a constructive way forward.
It's hard to say that tight coupling of GeoZarr to CF is a constructive way forward. I'm still waiting to hear a decent argument for why this is a good idea!
Hi David,
I wrote you a private message right after you posted your response to my post, suggesting that you edit your response to make it less impulsive (if you'll allow me) and more substantial, and you confirmed that you would by COB today. Up to this point you have not done so, so I am going to respond to your original response here, and now:
It is impossible to respond to you on any substantial points because you tell me that what I wrote is incorrect but you do not elaborate on how any of the things that I mentioned are incorrect and what would the be the correct way of phrasing it.
It is unhelpful to make assertions that CF is wrong or poorly written, and then offer incorrect or speciously reasoned alternatives
to imply that CF is "wrong" because you don't like the way certain sections are written is not a constructive way forward
These two statements are unfair and below you. Nowhere do I say that CF is wrong. I have used the word "faulty" and in a reasoned context, it is not "because [I] don't like the way certain sections are written". I again respectfully ask that you elaborate how my "incorrect or speciously reasoned alternatives" are incorrect, in a tone that is respectful and conducive to arriving at a conclusion that all can agree with.
Enough of this.
GeoZarr and CF
Your misgivings appear to be related to the CF-netCDF encoding, which is not really the point in this repo
In my opinion, there is indeed much in the CF metadata conventions that is in need of a careful and critical review before it gets incorporated into GeoZarr. That careful and critical review, if it has taken place, is not evident to me in the current draft of the GeoZarr specifications. I'd be happy to receive pointers to any documentation where this is reported, if you are aware of any. Adopting the CF coordinate variable as-is, with its problematic indirect relationship to other elements that make up the coordinate system, is a major omission from my perspective and hence why I raise it here. That is exactly the point of this repo, which is all about the GeoZarr specification.
You point to the CF data model and that is a very positive thing. The CF data model has a domain, as well as axes, and it could well serve as a model for incorporating it in GeoZarr. I would invite you, and others, to elaborate on how this concept could work for the GeoZarr specification rather than leave it to the implementers of the specification, as is currently the case with CF. You might even convince @geospatial-jeff Jeff if you present a decent argument!