geozarr-spec
geozarr-spec copied to clipboard
`grid_mapping` variables in GeoZarr
Caveat: I'm not a CF expert, so I would appreciate any corrections from folks who know the spec better.
CF defines semantics for a "grid mapping variable" Please let me know if there's a better reference than this. From the linked text:
When the coordinate variables for a horizontal grid are not longitude and latitude, it is required that the true latitude and longitude coordinates be supplied via the coordinates attribute. If in addition it is desired to describe the mapping between the given coordinate variables and the true latitude and longitude coordinates, the attribute grid_mapping may be used to supply this description. This attribute is attached to data variables so that variables with different mappings may be present in a single file. The attribute takes a string value which is the name of another variable in the file that provides the description of the mapping via a collection of attached attributes. This variable is called a grid mapping variable and is of arbitrary type since it contains no data. Its purpose is to act as a container for the attributes that define the mapping. The one attribute that all grid mapping variables must have is grid_mapping_name which takes a string value that contains the mapping's name. The other attributes that define a specific mapping depend on the value of grid_mapping_name. The valid values of grid_mapping_name along with the attributes that provide specific map parameter values are described in Appendix F, Grid Mappings.
My naive understanding is that the grid mapping variable is really just a collection of attributes. It's only represented as variable because of limitations in the type system used by netcdf for attributes, namely the lack of a key: value data structure like JSON objects. For this reason CF says the data type of a grid mapping variable can be ignored, as it contains no data.
If we want to translate "grid mapping variable" semantics to GeoZarr, there are a few concerns:
- GeoZarr coordinate variables must be 1D, but in CF grid mapping variables are scalars (0D). Maybe that's not a hard CF requirement. We could totally use empty 1D arrays in GeoZarr here. Except for point number 2:
- Zarr uses JSON for attributes, which supports arbitrary key: value data structures (JSON objects). So, as far as I can tell, in Zarr there is never any need for a "no data, just attributes" array like the grid mapping variable. Instead, the grid mapping variable's attributes can be inlined exactly where that variable is declared.
So my recommendation for GeoZarr would be to describe how the CF Grid Mapping Variable semantics should be translated to Zarr, e.g.:
CF defines a special scalar variable called a "Grid Mapping Variable" which is referenced by name in the
"grid_mapping"key of another variable in the same Dataset. The CF grid mapping variable contains no array data and is effectively just a collection of attributes. In Geozarr, this relationship is simplified: Instead of using a separate Zarr array to represent a grid mapping, all the attributes that would be associated with a CF grid mapping variable are defined directly in the attributes of the DataArray under the"grid_mapping"key.
This would of course have examples and probably the language could be made more clear.
And I'm wondering how many things in CF take a different form when the attributes model has JSON's type system? That should be generally important for GeoZarr to explicitly declare. For example, in CF the "coordinates" field is a whitespace-separated list of strings. This only makes sense if the type system of the attributes does not support arrays of strings. Zarr's attributes type system does support arrays of strings, so IMO we should use this in GeoZarr.
to make this more clear, I will show what I mean using the example from the CF section on grid mapping variables:
This is that example verbatim:
Example 5.6. Rotated pole grid
dimensions:
rlon = 128 ;
rlat = 64 ;
lev = 18 ;
variables:
float T(lev,rlat,rlon) ;
T:long_name = "temperature" ;
T:units = "K" ;
T:coordinates = "lon lat" ;
T:grid_mapping = "rotated_pole" ;
char rotated_pole
rotated_pole:grid_mapping_name = "rotated_latitude_longitude" ;
rotated_pole:grid_north_pole_latitude = 32.5 ;
rotated_pole:grid_north_pole_longitude = 170. ;
float rlon(rlon) ;
rlon:long_name = "longitude in rotated pole grid" ;
rlon:units = "degrees" ;
rlon:standard_name = "grid_longitude";
float rlat(rlat) ;
rlat:long_name = "latitude in rotated pole grid" ;
rlat:units = "degrees" ;
rlon:standard_name = "grid_latitude";
float lev(lev) ;
lev:long_name = "pressure level" ;
lev:units = "hPa" ;
float lon(rlat,rlon) ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
float lat(rlat,rlon) ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
and here is the model I would recommend for GeoZarr, namely remove the need for a special "rotated_pole" variable, and put all the attributes for "rotated_pole" directly under the "grid_mapping" key in the attributes for T:
dimensions:
rlon = 128 ;
rlat = 64 ;
lev = 18 ;
variables:
float T(lev,rlat,rlon) ;
T:long_name = "temperature" ;
T:units = "K" ;
T:coordinates = "lon lat" ;
T:grid_mapping = {
grid_mapping_name: "rotated_latitude_longitude" ,
grid_north_pole_latitude: 32.5,
grid_north_pole_longitude: 170,
}
float rlon(rlon) ;
rlon:long_name = "longitude in rotated pole grid" ;
rlon:units = "degrees" ;
rlon:standard_name = "grid_longitude";
float rlat(rlat) ;
rlat:long_name = "latitude in rotated pole grid" ;
rlat:units = "degrees" ;
rlon:standard_name = "grid_latitude";
float lev(lev) ;
lev:long_name = "pressure level" ;
lev:units = "hPa" ;
float lon(rlat,rlon) ;
lon:long_name = "longitude" ;
lon:units = "degrees_east" ;
float lat(rlat,rlon) ;
lat:long_name = "latitude" ;
lat:units = "degrees_north" ;
I think this would not be possible in netcdf, but IMO it's the most natural way to express this information in Zarr.
Please let me know if there's a better reference than this.
I've been basing all my assessments on CF conventions v 1.12: https://cfconventions.org/Data/cf-conventions/cf-conventions-1.12/cf-conventions.html#grid-mappings-and-projections
This is a cool idea for providing a more cloud-optimized structure for the GeoZarr metadata.
NetCDF attributes have types. How would you handle the conversion of different types (single, double, int, double array, etc.) to Zarr's JSON metadata model?
Hi @d-v-b,
That all makes sense to me. The grid_mapping = {...} encoding in your example is fully equivalent to the CF-netCDF formulation, so no problem there.
It might be worth checking out CF-1.12 Example 5.10 in the latest conventions, though, which use the "extended" grid mapping form, which allows us to associate particular coordinates with particular grid mappings. Is that a use case of interest to GeoZarr?
(I don't know what's going with https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch05s06.html, but it's not right! It predates CF-1.7 by a good couple of years - a legacy copied over from CF's pre-GitHub, pre-asciidoc days, perhaps. Best to get the conventions from the single-page documents under https://cfconventions.org)
NetCDF attributes have types. How would you handle the conversion of different types (single, double, int, double array, etc.) to Zarr's JSON metadata model?
I see two options:
-
For maximal compatibility, we could annotate all typed quantities with their netcdf type, e.g.
{ "x_is_a_double_in_netcdf": {"value": 10.0, "type": "double"}, "y_is_a_single_in_netcdf": {"value": 10.0, "type": "single"} }This adds a substantial layer of friction when interfacing with any downstream tools, because every numeric value has to be unwrapped. It also makes JSON documents bigger, and longer to read.
-
Alternatively, we could treat the specific numeric type as an implementation detail that doesn't matter for the data model. This would make the netcdf -> zarr conversion lossy, because the numeric type information would be lost when converting to zarr.
Choosing between these two comes down to which is more valuable -- preserving the type information, or using a simpler API. That's a community decision :)
Thanks @maxrjones and @davidhassell for pointing me to the updated spec document! My mistake for relying on google here.
It looks like the latest version adds some expressiveness to the grid_mapping attribute.
considering the linked example that uses the "extended form":
dimensions:
z = 100;
y = 100000 ;
x = 100000 ;
variables:
double x(x) ;
x:standard_name = "projection_x_coordinate" ;
x:long_name = "Easting" ;
x:units = "m" ;
double y(y) ;
y:standard_name = "projection_y_coordinate" ;
y:long_name = "Northing" ;
y:units = "m" ;
double z(z) ;
z:standard_name = "height_above_reference_ellipsoid" ;
z:long_name = "height_above_osgb_newlyn_datum_masl" ;
z:units = "m" ;
double lat(y, x) ;
lat:standard_name = "latitude" ;
lat:units = "degrees_north" ;
double lon(y, x) ;
lon:standard_name = "longitude" ;
lon:units = "degrees_east" ;
float temp(z, y, x) ;
temp:standard_name = "air_temperature" ;
temp:units = "K" ;
temp:coordinates = "lat lon" ;
temp:grid_mapping = "crsOSGB: x y crsWGS84: lat lon" ;
float pres(z, y, x) ;
pres:standard_name = "air_pressure" ;
pres:units = "Pa" ;
pres:coordinates = "lat lon" ;
pres:grid_mapping = "crsOSGB: x y crsWGS84: lat lon" ;
int crsOSGB ;
crsOSGB:grid_mapping_name = "transverse_mercator";
crsOSGB:semi_major_axis = 6377563.396 ;
crsOSGB:inverse_flattening = 299.3249646 ;
crsOSGB:longitude_of_prime_meridian = 0.0 ;
crsOSGB:latitude_of_projection_origin = 49.0 ;
crsOSGB:longitude_of_central_meridian = -2.0 ;
crsOSGB:scale_factor_at_central_meridian = 0.9996012717 ;
crsOSGB:false_easting = 400000.0 ;
crsOSGB:false_northing = -100000.0 ;
crsOSGB:unit = "metre" ;
int crsWGS84 ;
crsWGS84:grid_mapping_name = "latitude_longitude";
crsWGS84:longitude_of_prime_meridian = 0.0 ;
crsWGS84:semi_major_axis = 6378137.0 ;
crsWGS84:inverse_flattening = 298.257223563 ;
I think this could be transformed:
dimensions:
z = 100;
y = 100000 ;
x = 100000 ;
variables:
double x(x) ;
x:standard_name = "projection_x_coordinate" ;
x:long_name = "Easting" ;
x:units = "m" ;
double y(y) ;
y:standard_name = "projection_y_coordinate" ;
y:long_name = "Northing" ;
y:units = "m" ;
double z(z) ;
z:standard_name = "height_above_reference_ellipsoid" ;
z:long_name = "height_above_osgb_newlyn_datum_masl" ;
z:units = "m" ;
double lat(y, x) ;
lat:standard_name = "latitude" ;
lat:units = "degrees_north" ;
double lon(y, x) ;
lon:standard_name = "longitude" ;
lon:units = "degrees_east" ;
float temp(z, y, x) ;
temp:standard_name = "air_temperature" ;
temp:units = "K" ;
temp:coordinates = "lat lon" ;
temp:grid_mapping = {
crsOSGB:
{
coordinates: ["x", "y",],
grid_mapping_name: "transverse_mercator",
"configuration": {...}
},
crsWGS84:
{
coordinates: ["lat", "lon",],
grid_mapping_name: "latitude_longitude",
"configuration": {...}
}
}
float pres(z, y, x) ;
pres:standard_name = "air_pressure" ;
pres:units = "Pa" ;
pres:coordinates = "lat lon" ;
pres:grid_mapping = "crsOSGB: x y crsWGS84: lat lon" ;
pres:grid_mapping = {
crsOSGB:
{
coordinates: ["x", "y",],
grid_mapping_name: "transverse_mercator",
"configuration": {...}
},
crsWGS84:
{
coordinates: ["lat", "lon",],
grid_mapping_name: "latitude_longitude",
"configuration": {...}
}
}
There's a cost here associated with repeating the same information twice. But the upside is a bit more expressiveness. And if the repetition is problematic, then the grid_mapping could be a reference to a name of an attribute stored in the dataset attribute namespace.
I would need to study this more extensively to see if the "simple grid mapping" and the "extended grid mapping" can be expressed with the same simple JSON type. It seems like the simple form does not allow declaring coordinate variables, but otherwise they are similar enough.
if the repetition is problematic, then the grid_mapping could be a reference to a name of an attribute stored in the dataset attribute namespace.
Using and referencing a dataset attribute fits nicely with one of CF's Principles for design (section 1.2): "To avoid potential inconsistency within the metadata, the conventions should minimise redundancy."
@d-v-b - rather than duplicating the grid_mapping attribute on each variable, it could be a group-level attribute.
However, the broader problem with what you're proposing is that it will break tools (like CF-Xarray) which assume the grid_mapping is CF compliant and stored on a standalone variable (empty 0-D array). Now those tools will have to check for two different conventions.
The argument for leaving it as is is that NetCDF doesn't support nested dict-style attributes. So if we want to keep interoperability with NetCDF, we need to use the lowest common denominator of features.
I don't really understand the downside to leaving the status quo. We already do this today all the time with Zarr.
- GeoZarr coordinate variables must be 1D, but in CF grid mapping variables are scalars (0D).
This statement doesn't make sense for me. Why can't grid_mapping just be OD in GeoZarr?
I don't really understand the downside to leaving the status quo. We already do this today all the time with Zarr.
The status quo requires creating a Zarr Array for the sole purpose of accessing its attributes. Given that we can put these attributes somewhere else (e.g., the group attributes) while preserving the CF data model exactly, the status quo wastes stored objects, compute cycles, IO budgets, and adds complexity to implementations.
The argument for leaving it as is is that NetCDF doesn't support nested dict-style attributes. So if we want to keep interoperability with NetCDF, we need to use the lowest common denominator of features.
I'm not following this. Why should we avoid using JSON features, in Zarr, because of the way NetCDF's attributes work? The model I'm proposing can be transformed back into NetCDF attributes. (It would have no value otherwise). Lossless translation between Zarr and NetCDF seems like the right target, not literally copying netcdf's attributes.
This statement doesn't make sense for me. Why can't
grid_mappingjust be OD in GeoZarr?
because geozarr spec states that coordinate arrays are 1 dimensional. See https://github.com/zarr-developers/geozarr-spec/issues/84 for more discussion on this point.
@d-v-b the model I would recommend for GeoZarr, namely remove the need for a special "rotated_pole" variable, and put all the attributes for "rotated_pole" directly under the "grid_mapping" key in the attributes for T:
I suspect this was a legacy development fix when the rotated_pole map projection was invented and needed to be added. Your proposal is good.
Creating an [empty] array is not that expensive.
Having multiple different standards for the same thing also adds complexity to implementations.
Literally copying NetCDF layout, attributes, etc. is what we have done with Zarr in Xarray for the past 9 years. It has worked pretty well.
@rabernat The argument for leaving it as is is that NetCDF doesn't support nested dict-style attributes. So if we want to keep interoperability with NetCDF, we need to use the lowest common denominator of features.
I think that a "lift and shift" of CF-NetCDF features into the new environment will be a serious long term mistake. I've done it in the past with widely deployed systems, and it really hinders the effective use of new environments and gives a legacy lock-in. @d-v-b 's approach of lossless two-way mapping of features is highly desirable.
Look, I totally understand where @d-v-b is coming from! Obviously if you are starting from scratch and only considering what is the best way to map the data model to Zarr, using nested dicts makes total sense.
I'm just coming at this from the perspective of a potential maintainer of Xarray / CF-Xarray / RioXarray / OpenDataCube, etc etc. Any tool that currently assumes NetCDF-style grid_mapping variables will not work out of the box with this new convention. Every different implementation will have to introduce some new if / else logic to figure out which convention is being used for a given dataset.
There is no central point in the ecosystem which manages these semantics; perhaps that's the core issue.
I'm just coming at this from the perspective of a potential maintainer of Xarray / CF-Xarray / RioXarray / OpenDataCube, etc etc. Any tool that currently assumes NetCDF-style grid_mapping variables will not work out of the box with this new convention. Every different implementation will have to introduce some new if / else logic to figure out which convention is being used for a given dataset.
There is no central point in the ecosystem which manages these semantics; perhaps that's the core issue.
This seems to highly constrain GeoZarr. What exactly are the degrees of freedom in this spec?
Perhaps some of the stable points in the ecosystems are the controlled, and perhaps mandated, vocabularies (e.g. meteorology, aviation, defence and even research communities)
And highly flexible systems end up needing constraining profiles to improve interoperability - hence COARDS and CF!
I find @d-v-b's proposal compelling - using Zarr's native JSON capabilities seems like the natural approach. But I understand @rabernat's compatibility concerns. I agree with @chris-little that we shouldn't lock ourselves into legacy patterns. Zarr is positioned to become the next-generation nd-arrays geospatial format, similar to COG's role for geo raster. But COG had to work within TIFF's constraints, leading to workarounds and limitations that still cause issues today. Do we want GeoZarr to inherit NetCDF's architectural constraints, or should we fully leverage Zarr's modern design? This decision will impact the format for years to come.
Could we explore a transition strategy that addresses both perspectives? Some questions to consider:
On compatibility:
- What if GeoZarr tools supported both patterns during a transition period? Would checking for
grid_mappingas JSON first, then falling back to variable references, be feasible? - How many tools would actually need updates? Is it primarily xarray, cf-xarray, rioxarray, and OpenDataCube, or are there others?
- Could we provide conversion utilities to help maintainers?
On the broader principle:
- Beyond
grid_mapping, are there other CF conventions that would benefit from JSON's type system (like the coordinates field that @d-v-b mentioned)? - How do we balance innovation with stability? Is there a way to leverage Zarr's capabilities without creating too much disruption?
Practical next steps:
- Would it help to create a proof-of-concept showing how tools could handle both formats? Or directly propose PRs?
- Should we document all affected components before making a decision?
- What timeline would give maintainers enough runway to adapt?
@rabernat, what would make you more comfortable with this change? And @d-v-b, would a phased approach with backwards compatibility address the concerns raised?
Backwards compatibility would be doable.
The JSONification of these fields is generally f(stringly typed object) -> JSON object | array. I don't think f(stringly typed object) -> string will occur. So for a given field that can occur as a raw NetCDF attribute (a string) or as a JSON-native version (an object, or array, or ...), a type check on the value of the field should be sufficient to distinguish the two cases.
What if GeoZarr tools supported both patterns during a transition period? Would checking for grid_mapping as JSON first, then falling back to variable references, be feasible?
I don't think that this is a good idea :), as software libraries will have to support both patterns forever, even after the netCDF-style pattern has been deprecated/removed, so that they can deal with legacy datasets. Better to make a decision on one of the patterns and move on from there. A software library can still choose to support non-standardised metadata if it wants to, but that should not be GeoZarr's concern, I think.
@rabernat, what would make you more comfortable with this change?
It's not a question of making me personally comfortable. The question is: who is going to be responsible for implementing such a breaking change in how CRS metadata is stored throughout the geospatial ecosystem?
The challenge we have is that there is not a single, central implementation point where the CF conventions are parsed out of the low-level data containers. Instead, there is a vast ecosystem of very loosely coordinated software.
As a first step, we can enumerate exactly which software is relying on the current convention.
Here's a very partial list based on a GitHub search
- CF Xarray - https://github.com/xarray-contrib/cf-xarray/blob/cfe87d0bcdce241c2387460fa9147a390e4b6a19/cf_xarray/accessor.py#L443
- rioXarray - https://github.com/corteva/rioxarray/blob/master/rioxarray/rioxarray.py#L448
- ODC Geo
- Stars (R pacakge)
- pyCordex
- Xee (Google Earth Engine Xarray Adaptor)
- CFCoordinateReferenceSystems (Julia)
So if we adopt this proposal for GeoZarr, none of that software will be able to discover and use the CRS. The maintainers will end up here, asking why we knowlingly broke their software.
The situation is perhaps a little easier In Python, because we have Xarray as a single entry point into multiple file formats (specifically, NetCDF, Zarr, and GeoTIFF [via rioXarray]). Xarray is confusing because it does some CF decoding (e.g. datetimes, coordinates vs. data variables) but not all. Notably, Xarray does not do anything geospatial. It does not parse or interpret grid_mapping in any way. It just passes the variables and attributes through to the next layer in the stack. And at the next layer (CRS aware software), we have (at least) three different implementations of this logic: CF-Xarray, rioXarray, and ODC Geo. (Keep in mind that all of that software is agnostic to whether the data originate in NetCDF, Zarr, or TIFF.)
So if we go the route of changing the grid_mapping convention in GeoZarr, we could perhaps solve this problem in Python by:
- Creating a package (
geoxarray) which can interpret multiple different CRS conventions in raw Xarray datasets (NetCDF-stylegrid_mappingattribute, plus whatever different choice GeoZarr makes) and put them behind a single consistent API (probably an xarray accessor like.crs) - Factoring out all of the CRS interpretation logic from CF-Xarray, rioXarray, and ODC Geo to rely on this new package. (Keep in mind that these package maintainers are already spread thin and are not actively asking for this sort of contribution.)
We would then want to do the same in R, Julia, maybe Rust, wherever else the convention is being used.
That sounds like a lot of work! And for what? What's the benefit? I'm not convinced by the "reduced I/O argument." We're just talking about a tiny bit of metadata here! The core benefit or Zarr are its ability to store extremely datasets, its flexible chunking and compression, and its great cloud-native performance compared to legacy file formats. That's why people are migrating to Zarr--not to have nested attributes! The comments about "lift and shift" are a red herring. Zarr's complete metadata compatibility with NetCDF is a great feature! It is what has enabled so much of Zarr's adoption today!
Let me reiterate that I personally think it "makes sense" to use dictionaries for this type of information. But I'm also keenly aware of the friction that breaking changes cause for downstream packages and the consequences of these changes for developer goodwill. (We are still dealing with the fallout from the API changes in Zarr 2 vs Zarr 3.)
This seems to highly constrain GeoZarr. What exactly are the degrees of freedom in this spec?
From the beginning, the goal of GeoZarr has been interoperability. The proposal here strongly impacts interoperability.
That sounds like a lot of work! And for what? What's the benefit? I'm not convinced by the "reduced I/O argument." We're just talking about a tiny bit of metadata here! The core benefit or Zarr are its ability to store extremely datasets, its flexible chunking and compression, and its great cloud-native performance compared to legacy file formats. That's why people are migrating to Zarr--not to have nested attributes! The comments about "lift and shift" are a red herring. Zarr's complete metadata compatibility with NetCDF is a great feature! It is what has enabled so much of Zarr's adoption today!
I agree that it sounds like a lot of work. But I think a stringly typed API is much more expensive in the long run. As long as CF attributes are kept as strings, they will be inaccessible to JSON schema validators, and every implementation that consumes them will have to perform the exact same string parsing + normalization procedure. If these processes are implemented independently, then there's a risk of inconsistencies or bugs. Using our type system lets us make a bunch of invalid states unrepresentable, which saves effort from everyone downstream in the long run.
From the beginning, the goal of GeoZarr has been interoperability. The proposal here strongly impacts interoperability.
To me it seems that the entity GeoZarr is supposed to be interoperable with is underdocumented, and that makes interpreting or extending this spec difficult. For example, had I known that JSONifying CF attributes was out of scope, I would not have opened this issue. If there's already a dominant convention for interpreting NetCDF metadata into zarr attributes, then this should be documented in this spec.
classic example
int crsOSGB ;
crsOSGB:grid_mapping_name = "transverse_mercator";
crsOSGB:semi_major_axis = 6377563.396 ;
crsOSGB:inverse_flattening = 299.3249646 ;
crsOSGB:longitude_of_prime_meridian = 0.0 ;
crsOSGB:latitude_of_projection_origin = 49.0 ;
crsOSGB:longitude_of_central_meridian = -2.0 ;
crsOSGB:scale_factor_at_central_meridian = 0.9996012717 ;
crsOSGB:false_easting = 400000.0 ;
crsOSGB:false_northing = -100000.0 ;
crsOSGB:unit = "metre" ;
int crsWGS84 ;
crsWGS84:grid_mapping_name = "latitude_longitude";
Two grid mappings 🤦
One is entirely redundant and damaging (unprojected full-expanded lonlat from compactible x,y coordinates). The other is probably degenerate rectilinear. This is a fundamental problem in the array community. How to declare explicitly which of the cases is correct? (There's utterly no point in storing longlat coords and x,y - that is obviously a problem). Is the x,y space a regular one? (Absolutely very probably). I bet we can't tell without analysis though. This stuff should be left in the dark ages and eschewed for the problems it has caused for so long. This is fundamentally a Zarr problem. Trying to hike it into being a GeoZarr responsibility is a major blocker.
This is fundamentally a Zarr problem.
This is not a Zarr problem at all. Zarr says nothing about coordinates period. It's a low-level container for storing ND-arrays and metadata, like HDF5. I think the thing you're criticizing the lifting-and-shifting of NetCDF style coordinates into Zarr. That's precisely the sort of thing that GeoZarr is intended to fix.
Thanks @rabernat for this insight - you've nailed the core issue that should probably be front and center in the GeoZarr spec.
You're absolutely right that Zarr is fundamentally
"a low-level container for storing ND-arrays and metadata, like HDF5"
and doesn't provide the semantic constructs needed for geospatial workflows.
This highlights what I think should be the primary framing of GeoZarr: Zarr provides an efficient chunked storage system, but it lacks essential concepts like coordinate systems, grid mappings, and semantic metadata that geospatial applications require. GeoZarr exists specifically to add these missing geospatial semantics on top of Zarr's foundation.
The current spec abstract mentions building upon CDM and CF conventions, but doesn't clearly articulate why this is necessary. It should probably lead with something like: "Zarr lacks the semantic constructs required for geospatial data. GeoZarr adds essential concepts like coordinate systems, grid mappings, and CF-compliant metadata to enable interoperable geospatial workflows on Zarr's efficient storage foundation."
Ok, doesn't change my stance though. If Zarr is just a container and provides no guidance about smart and efficient and robust relations and positions then the world will never stop pouring unnecessary curvilinear and rectilinear coordinates into it, the same as happened with NetCDF. The problem is not Zarr, but the way it's touted as a panacea for all things going forward without those champions having faced up these basic problems.
I'm sorry, that was too strong, and out of place here. I need to describe my take properly elsewhere, and find how to contribute here 🙏
In this PR #89, alongside terminology clarifications, I also put propositions to improve the abstract and intro section to describe the purpose of the GeoZarr spec better.
There is no central point in the ecosystem which manages these semantics; perhaps that's the core issue.
xproj is trying to address this issue, at least in the Xarray (broader) ecosystem.
xproj is trying to address this issue, at least in the Xarray (broader) ecosystem.
This looks very promising. I wonder what it would take for odc-geo, rioxarray, xee, etc. to be able to adopt xproj and remove their own CRS-handling logic. @benbovy - have you had any conversations with those maintainers.
I would be more open to supporting bigger changes in CRS encoding if we could also present a plan for maintaining interoperability and compatibility across the ecosystem.
have you had any conversations with those maintainers?
Yes, at least in a few discussion threads.
I wonder what it would take for odc-geo, rioxarray, xee, etc. to be able to adopt xproj and remove their own CRS-handling logic.
I've no clear idea on that yet. We should probably figure it out in conjunction with (hopefully) the adoption of rasterix in those packages as well, since both xproj and rasterix share the same goal: provide core functionality to the Xarray geospatial ecosystem through a small set of composable building blocks (the former is focused on CRS whereas the latter is focused on raster coordinate transformations).
Some related issues / PRs:
- integration of
xprojwithxvec: https://github.com/xarray-contrib/xvec/pull/120 - integration of
xprojwithrasterix: https://github.com/xarray-contrib/rasterix/pull/31 (with some remaining quirks, though) - integration of
rasterixwithrioxarray: https://github.com/corteva/rioxarray/pull/855 (I need to get back to this PR!) rasterixandodc-geo? https://github.com/opendatacube/odc-geo/issues/235
(Putting all this together may help towards better maintenance of interoperability and compatibility, although it is obviously limited to the Xarray / Python ecosystem)
We should probably figure it out in conjunction with (hopefully) the adoption of
rasterixin those packages as well, since bothxprojandrasterixshare the same goal: provide core functionality to the Xarray geospatial ecosystem through a small set of composable building blocks
I love this vision. For maintainability and testing purposes, might it not be easier to combine rasterix and xproj into a single package called geoxarray?