[Parquet-raster] Support > 2D rasters like Zarr and NetCDF
This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?
Originally posted by @rouault in https://github.com/opengeospatial/geoparquet/pull/259#discussion_r2063291940
Specifically for stuffing these things into Parquet files, we would probably want the 2D slices of the Zarr or NetCDF in separate rows (to reduce the size of each row in the event that a value needs to be materialized). I don't have a lot of experience with the uses of n-D rasters and how they're parameterized and/or used but if there's prior art to draw from it seems like it might be nice to allow it (there can be compute operators to separate slices into rows before a potential load).
I also don’t have a lot of experience with n-D rasters. I imagine that a proper handling would store sub-cubes rather than 2D slices? No strong opinion.
@paleolimbot I think it's ok to have nd-chunks encoded in a single row, anything else is extra complicated (especially with now such strong encoding idioms for chunks in Zarr and virtual Zarr). This will certainly be limited to "practical blob sizes" - but, there's nothing inherently "larger" or problematic about n-D vs 2d chunks - you can obviously have any combination of shapes here. (Within practical size limits this is really interesting and powerful). So, hence why I think being able reference external chunks (ala VirtualiZarr) - if that could be be included it would be awesome.
I think what I'm wondering is how it might complicate specifying these things...with 2D it's fairly straightforward to consider an X and Y axis direction and avoid getting too far into the meaning of the value. I feel like with nd there are some things that then need specifying (like which dimensions are spatial and which are non-spatial?) Do you have a link to some prior art on how these are specified elsewhere?
well, Zarr is how to do it, and GeoZarr is the current community trying to merge the "GIS 2D raster with transform" concept with the looser "coordinate model with NetCDF". My take is that for each chunk you store a reference index (for 4D in tzyx 24x12x256x512) for each chunk (say 16 chunks each 12x6x128x256):
0.0.0.0
1.0.0.0
...
1.1.1.1
This is the array position of each chunk, multiplied out actual chunk shape, these are literally the Zarr labels used, though the materalized chunk object names vary (as labels or nested dirs) between V2, V3, and now Icechunk.
There's conventions on how the dimensions tick over but you only need that overall metadata (global shape in nD, local chunk shape in nD) to know where this encoded chunk of values belongs.
The "georeferencing" i.e. then range of coordinates in each dimension or coordinate for each cell - this is where the GIS folks and netcdf folks never see the world the same way - then is completely separate from the array/chunk indexing itself because it corresponds to each dimension independently (usually).
Certainly happy to pursue this, I find it pretty interesting. I've wanted to put together an example, to literally represent a 4D dataset in a table just to make it all explicit and explore the space a bit. xarray and VirtualiZarr have very accessible tools for this but it's bit hidden under many layers of functionality.
I took this a bit further with an actual example, not sure it's entirely helpful but I can do more going forward
https://github.com/mdsumner/zarrquet
and, obviously the chunks can be referenced virtually as here (remapped to the url as need be), or stored in the table as binary - this happens when you specify "loadable_variables" - something I should add to the example for x,y,z,t at least
I think we are fine storing the transform in a matrix instead of explicitly calling out individual values. Do you want to make a PR and try to support this? @mdsumner
Perhaps a place to start is to store a shape: [...] instead of width, height. It might be that we don't have an answer for exactly how to do this right now, but at least our spec could support it in the future without a breaking change?