geoparquet icon indicating copy to clipboard operation
geoparquet copied to clipboard

[Parquet-raster] Support > 2D rasters like Zarr and NetCDF

Open jiayuasu opened this issue 11 months ago • 9 comments

          This proposal is restricted to 2D rasters. Is this is a conscious choice to not allow n-D rasters such as permitted by Zarr or netCDF ?

Originally posted by @rouault in https://github.com/opengeospatial/geoparquet/pull/259#discussion_r2063291940

jiayuasu avatar May 05 '25 18:05 jiayuasu

Specifically for stuffing these things into Parquet files, we would probably want the 2D slices of the Zarr or NetCDF in separate rows (to reduce the size of each row in the event that a value needs to be materialized). I don't have a lot of experience with the uses of n-D rasters and how they're parameterized and/or used but if there's prior art to draw from it seems like it might be nice to allow it (there can be compute operators to separate slices into rows before a potential load).

paleolimbot avatar May 06 '25 15:05 paleolimbot

I also don’t have a lot of experience with n-D rasters. I imagine that a proper handling would store sub-cubes rather than 2D slices? No strong opinion.

migurski avatar May 06 '25 23:05 migurski

@paleolimbot I think it's ok to have nd-chunks encoded in a single row, anything else is extra complicated (especially with now such strong encoding idioms for chunks in Zarr and virtual Zarr). This will certainly be limited to "practical blob sizes" - but, there's nothing inherently "larger" or problematic about n-D vs 2d chunks - you can obviously have any combination of shapes here. (Within practical size limits this is really interesting and powerful). So, hence why I think being able reference external chunks (ala VirtualiZarr) - if that could be be included it would be awesome.

mdsumner avatar May 07 '25 00:05 mdsumner

I think what I'm wondering is how it might complicate specifying these things...with 2D it's fairly straightforward to consider an X and Y axis direction and avoid getting too far into the meaning of the value. I feel like with nd there are some things that then need specifying (like which dimensions are spatial and which are non-spatial?) Do you have a link to some prior art on how these are specified elsewhere?

paleolimbot avatar May 07 '25 00:05 paleolimbot

well, Zarr is how to do it, and GeoZarr is the current community trying to merge the "GIS 2D raster with transform" concept with the looser "coordinate model with NetCDF". My take is that for each chunk you store a reference index (for 4D in tzyx 24x12x256x512) for each chunk (say 16 chunks each 12x6x128x256):

0.0.0.0
1.0.0.0
...
1.1.1.1

This is the array position of each chunk, multiplied out actual chunk shape, these are literally the Zarr labels used, though the materalized chunk object names vary (as labels or nested dirs) between V2, V3, and now Icechunk.

There's conventions on how the dimensions tick over but you only need that overall metadata (global shape in nD, local chunk shape in nD) to know where this encoded chunk of values belongs.

The "georeferencing" i.e. then range of coordinates in each dimension or coordinate for each cell - this is where the GIS folks and netcdf folks never see the world the same way - then is completely separate from the array/chunk indexing itself because it corresponds to each dimension independently (usually).

Certainly happy to pursue this, I find it pretty interesting. I've wanted to put together an example, to literally represent a 4D dataset in a table just to make it all explicit and explore the space a bit. xarray and VirtualiZarr have very accessible tools for this but it's bit hidden under many layers of functionality.

mdsumner avatar May 07 '25 00:05 mdsumner

I took this a bit further with an actual example, not sure it's entirely helpful but I can do more going forward

https://github.com/mdsumner/zarrquet

mdsumner avatar May 07 '25 01:05 mdsumner

and, obviously the chunks can be referenced virtually as here (remapped to the url as need be), or stored in the table as binary - this happens when you specify "loadable_variables" - something I should add to the example for x,y,z,t at least

mdsumner avatar May 07 '25 01:05 mdsumner

I think we are fine storing the transform in a matrix instead of explicitly calling out individual values. Do you want to make a PR and try to support this? @mdsumner

jiayuasu avatar May 12 '25 06:05 jiayuasu

Perhaps a place to start is to store a shape: [...] instead of width, height. It might be that we don't have an answer for exactly how to do this right now, but at least our spec could support it in the future without a breaking change?

paleolimbot avatar May 12 '25 15:05 paleolimbot