geozarr-spec
geozarr-spec copied to clipboard
How to map GeoZarr to datasets with many multiple CRSs
How should a STAC Collection containing many geospatial rasters that span many CRSs map to GeoZarr?
For example if I have Sentinel-2 or Landsat data spanning UTM zones, I don't want to have to reproject every scene to a common CRS in order to take advantage of chunking and sharding.
As I'm loading scenes from GeoZarr , making an ML prediction, and georeferencing the result, I also want to keep the original CRS so that I can keep the area preserving projection.
It seems like this has been discussed a bit here
It may be helpful to specify in which case a dataset has children dataset and how to index the hierarchy (Combined with STAC object, the GeoZarr Dataset might be restricted to a single variable and only contains multiscales as children alternative variables)
From #34
and @benbovy had a helpful proposal of how this might be achieved in Xarray. If this is to big of a change to Xarray maybe it would be possible to achieve this index of indexes in ran extension to Xarray? rioxarray?
https://github.com/pydata/xarray/pull/9543#issuecomment-2430017410
curious if others have thoughts on this. I don't have a well formulated design yet, still in the stage of reading and digesting the discussions up to today.
My opinion is it shouldn't, if you want a standardized index set one up with the footprint of every source to a common crs, use that to index and filter to the sources that will contribute to a final "render" in that common crs, or to any other grid spec required.
This is how standard tools already work (the GDAL API for example), and how specific implementations that load from stac work: gdalcubes, opendatacube, stackstac, the GDAL STACIT driver and (more generally) the GDAL GTI driver.
Look at how odc-geo maps pixel index coordinates from various crs to the tile pixel index of a target source, I suppose you could store references that do that, VirtualiZarr-like, but it's so easily calculated in the fly for an on-demand grid spec I really don't see the value. (Ok maybe I've talked myself into illustrating something ... this was brought up by Deepak in terms of getting the index and weights from GDAL, so maybe that's what's really being asked here, to virtualize and undercut current libs). 😆
here's the mailing list pointers to how one might get "index and weights" from the warper: https://lists.osgeo.org/pipermail/gdal-dev/2024-April/058906.html
In opendatacube I think it's a slightly different motivation, because they are trying to cull-filter the mapping between scenes and which tile/s they contribute to , so you can line up sources and target tiles (aka 'chunks' ...) . So yeah, maybe I've framed this topic in a way I can negotiate now (!)
From GDAL's GTI driver docs:
The tiles are listed as features of any GDAL supported vector format.
This would translate much more naturally to a GeoPandas' GeoDataFrame of Xarray DataArray objects than a plain data cube (see here for how this could work). The GeoDataFrame geometry column would store the footprint (bbox) of each scene in a common CRS so we can easily retrieve the scenes by spatial filtering. Basically the same approach than the one @mdsumner describes and that is indeed used by other tools (opendatacube, STAC tools, etc.), although those tools rely on other mechanisms than a GeoDataFrame for indexing the scenes.
This representation of individual scenes + geospatial index has some limitations, e.g., we cannot easily slice along other dimensions (time, band, etc.) as we could do with a plain datacube, but I guess that in most applications it is better anyway to do some spatial filtering first. It is also easier to reason about especially when the scenes are heterogeneous (i.e., different shapes and/or CRS), sparsely distributed and/or partially overlapping. In any case, this is much easier to implement than what I suggest in https://github.com/pydata/xarray/pull/9543#issuecomment-2430017410.
In terms of storage, with this approach we would still want to store the scenes each as a separate GeoZarr dataset. We could then store their footprint as a Geoparquet file or even into a basic relational database.
IIUC odc-geo's GridSpec is an alternative way of indexing the scenes in the specific (but pretty common?) case where those are arranged like a regular spatial grid. I'm wondering if in this case we couldn't have a Zarr "meta" dataset storing the information of each tile, i.e., their CRSs (e.g., using EPSG codes), their geo-transform (N+1 dimension array where the last dimension represents the 6 affine parameters) and even their footprint (assuming that GeoZarr would support vector geometries #28). That might be a little far-fetched, though, since we don't really need the full power of Zarr (cloud-optimized / n-arbitrary dimensions) to store this kind of information (pretty cheap, 2-dimensions only).
Also xref https://github.com/geoarrow/geoarrow/issues/24, which is relevant for this use case I think.
What about xvec?
Xvec and vector data cubes in general work best with geometries having attributes that also depend on other, non-geospatial dimensions like time, etc. Here this is a bit different: each geometry materializes the extent of a raster.
Similarly to a GeoDataFrame of DataArrays, maybe we could imagine an xvec DataArray of DataArrays? I'm not sure at all if/how this would work, though... Would there be any advantage of using xarray DataArrays over simpler structures like a GeoDataFrame or odc-geo's GridSpec here?
Perhaps Xvec may be useful for loading and handling the Zarr "meta" dataset (cf. my suggestion above)? Off the top of my head something similar to how we use stackstac to query and turn a STAC collection into a DatArray, but here exclusively based on (Geo)Zarr and Xarray (Xvec).
From GDAL's GTI driver docs:
The tiles are listed as features of any GDAL supported vector format.
This would translate much more naturally to a GeoPandas' GeoDataFrame of Xarray DataArray objects than a plain data cube (see here for how this could work).
It *is a * vector layer when you treat it as one (it's an abstraction), when you treat it as a "GTI:
Thanks for the clarification @mdsumner ! So if I understand well GDAl GTI (collection of rasters -> virtual raster) is quite similar to stackstac (STAC items -> lazy xarray DataArray), right?
To be clear, I'm not suggesting either to crush tabular data into Zarr. I guess I was trying to say that a table of rasters (data frame) vs. a mosaic of rasters (single array) have both pros and cons, but that might be slightly off-topic.
I also agree that a multi-crs array doesn't make much sense and that something like a tile index would be useful. Two questions:
-
Is there any interest in adding the tile index itself to the GeoZarr specs? Or should we leave that to other specs (e.g., STAC + GeoZarr)?
-
If the answer to 1 is yes, how the spec would look like for the tile index? What do we need? Can / should we put both the tile index and all the individual tiles in the same Zarr dataset (leveraging Zarr's group hierarchy)?
So if I understand well GDAl GTI (collection of rasters -> virtual raster) is quite similar to stackstac (STAC items -> lazy xarray DataArray), right?
yes - I was a too reactive on the "crush" thing, I went and made an example and while it's amazingly simple to craft, it's quite abstract and needs some explaining. I think it's just not worth mapping disparate sources to an actual array, the array has a spec (rank, coordinates) and that's completely abstract, you can pour any sources into that so I guess it comes down to how to think about the catalogue of possible sources and when the mapping starts to occur.
(and I think that mapping is interesting, like for a given target grid specification, which sources are actually relevant, but maybe filtering spatially is best done after temporal and data type filtering etc)
your Qs I don't think so - I think the catalogue topic is very dynamic and will change and consolidate a lot in fairly short time. I'm concerned that STAC is treated specially, because why not have any old grid sources in a lighter collection - GTI (and the GeoParquet example, already compatible with GTI shows you can do that, STAC becomes a bespoke case folded in.
here's a very very simple GTI example I crafted, two disparate polar crs grids to a shared global one: https://gist.github.com/mdsumner/724f1db0bd0d211de6e3775486cd3df0
From previous conversations, I believe @ljstrnadiii may have some experience with the DataFrame of DataArray's approach to this problem. Len, am I remembering that right and do you have any insights to share for this conversation?
Yeah, @maxrjones I used a similar schema as GTI for the purpose of indexing tiles for different collections (features), date times, and crs. Columns in my geopandas df are geometry (bbox in the geopandas df's crs coordinates), path to cog, collection name, datetime, crs.
My goal is to reproject and resample heterogeneous tiles (w.r.t. crs) into a single Xarray data array with the final dims: band, time, y, x. However, gdalwarp wasn't happy when I provided a list of tiles in different crs. This could be a user error thing lol. So, I first query relevant tiles from parquet/geopandas, then groupby crs, mosaic with vrt, warp to target crs, and finally mosaic warped vrts. I then do that per collection and per datetime.
I am actually really excited to try out GTI soon for larger mosaics because the current approach seems to take time and I think requires http request per tif to discover metadata to build the vrt. I can build GTI data while I'm ingesting the tifs and avoid the lookup and provide it per tif as recommended in the GTI docs.
My guess is that GTI will help a ton with that and seems like it supports heterogeneous tiles crs and resolution!
In regards to the post, I don't see how chunking is relevant across crs areas without reprojecting. I'm failing to follow here, but I don't have a good developer-level understanding of geozarr, GDAL, or Xarray.
I suppose I'd try GTI to reproject resample to one mosaic Xarray and then apply the model for predictions, etc. and reproject back to UTM per zone. That depends on information loss and is honestly a question I've had: does going from a particular UTM zone to 3857 (for example) and back cause data loss/error? It will certainly generate redundant data. Or maybe make one Xarray data array per UTM zone and chunk within utm zones mosaiked by GTI?
I have tried to use a dimension in the past for tile id for example, but that was awkward since each tile has different heights/widths and therefore coordinates indexed by tile id so concatenating over tile id was not ideal not to mention edge effects, etc.
However, gdalwarp wasn't happy when I provided a list of tiles in different
That definitely works, keen to follow up on this. (You can see in my gist above that it works). Certainly it has improved a lot in recent years, I used to encounter crashes a lot ~8 years ago but it's super robust now. I usually use it via the API directly, the high level Warp() function in C++ , or via python or R.
Super keen to explore GTI with you, GDAL is also very responsive and developing fast, and some parts don't get tested enough by the community (or tested too late via downstream packages).
🙏
I have a problem like Len's currently - not tiles but global heterogeneous surveys at different resolutions and shapes and CRS - so 'domain' is a possible index - gravity, magnetics, etc.
Might not just be errors @ljstrnadiii but artefacts too?
Wow thanks all for the quick and insightful comments! @mdsumner great recommendation on GTI that short example you provided is compelling. Looking forward to trying it out.
@ljstrnadiii you laid out a lot of options I've thought of trying and more I hadn't thought of, thanks!
In regards to the post, I don't see how chunking is relevant across crs areas without reprojecting.
My line of thinking was roughly what @benbovy discusses above with more clarity wrt to the Zarr "meta" dataset that tracks CRSs, geo-transform, and footprint. Each Zarr in a group would be for a single CRS. Reprojections would happen only on the fly if needed by the operation, like with GTI. Writing this out, it sounds like GTI with STAC geoparquet referencing COGs could be a great solution.
The only thing I'm wondering though is that Zarr v3 seems to have better support than other array formats for sharding, making it easy to rechunk, and compressing chunks with different methods. Or maybe (likely) this is all available in GDAL and I just need to chat with @mdsumner :)
I suppose I'd try GTI to reproject resample to one mosaic Xarray and then apply the model for predictions, etc. and reproject back to UTM per zone. That depends on information loss and is honestly a question I've had: does going from a particular UTM zone to 3857 (for example) and back cause data loss/error?
There's definitely some data corruption at higher latitudes but I'm very unsure on the extent of it and I think it is dependent on the size of the reprojected regions and latitude since mercator distorts more toward the poles. This makes me think it would be nice to tell GTI to reproject adjacent UTM scenes not to Web Mercator (or another virtual mosaic CRS) but something more locally accurate for the two scenes involved. Like one of the UTM projections of one of the scenes, determined by which UTM zone has the majority of pixel area.
Anyway I think the two questions posed by Benoit are good to discuss more! I'm curious to hear more from @mdsumner on his take that GeoZarr should not have a tile index.
the catalogue topic is very dynamic and will change and consolidate a lot in fairly short time.
I had the sense that GeoZarr is where this sort of OGC standardization on cataloguing should take place.
However, gdalwarp wasn't happy when I provided a list of tiles in different
That definitely works, keen to follow up on this. (You can see in my gist above that it works). Certainly it has improved a lot in recent years, I used to encounter crashes a lot ~8 years ago but it's super robust now. I usually use it via the API directly, the high level Warp() function in C++ , or via python or R.
Super keen to explore GTI with you, GDAL is also very responsive and developing fast, and some parts don't get tested enough by the community (or tested too late via downstream packages).
🙏
How far have you seen this taken Michael? In terms of lots of CRS, size etc?
More broadly than to any grand scale, I've been flushing through many GDAL issues that prevent it being used with problematic sources, the vrt:// protocol in particular avoids having to instantiate actual VRT or apply workarounds after open in R or python (which is often too late to still exploit core GDAL-laziness and other features).
There's usually a way to stream any (2D slice or 2D multiband interpretation of) a source through the warper, even curvilinear model output, and includes sources in netcdf/hdf/legacy file or linking multiple component files together. The downstream packages tend to have a too-high view of these things. I like to be able to open any source, perhaps with some light assigned vrt:// fixes and then simply read any window in any crs for a quick look. That's also scaleable to massive outputs that are written target-tile at a time, and is exactly how odc-geo/opendatacube works.
I think the entire question behind this issue comes down to 1) familiarity with GDAL warper and what it can do, and 2) wanting to have some kind of pre-mapped index of what source/s will impact what parts of a target output, i.e. essentially what odc-geo and other cubes do to farm out actual calls to the warper api between huge Sentinel collections and a target tiling.
At the high level, you can take all your input sources and ask GDAL to scan them to write "just this one tile" (imagine the COP30 VRT with 10000s of longlat geotiffs), but that's inefficient if you don't have the VRT structure - GTI allows GDAL to pre-scan more efficiently from even disparate sources by spatial index which sources are actually relevant, and odc-geo takes that further by mapping actual source footprint fragments to output tiles, and culling any cases of calling the warper that won't have any impact (it also applies masking logic to source tiles upfront as well).
I suppose a Zarr target grid that stored information about the list of relevant sources that will impact each chunk would be useful, and for extra points where those different footprints are between source and target chunks. The scheduler could farm out the sources to chunks very efficiently, and save that index for new data that's coming in from the same grid/s for a given target grid.
e.g., my first look is
gdaltindex -f GPKG grid_tile_index.gti.gpkg cog/*.tif
bash: /home/ubuntu/miniconda3/envs/grids/bin/gdaltindex: Argument list too long
so presumably then you have to build in code
This is 25K 'tiles' - a LOT of different crs flavours - a bunch of different domains - so what would your approach be there, from low-level on up?
Can we get an example? I realised one that I have, I'll come back once I put a couple of things together.
An example of what might be in there, small version? I could certainly find plenty of public analogues. For example, a small project raster in SA say 200x300, 32353, a chunk of brazil in utm 19s or something...map of sweden in 3006 in the thousands and a 20kx10k chunk of Africa...and tons of other small things...probably resolutions varying from 25m to a few km.
ok sure I see, but what would you do with that? you want to create a image server or tile pyramid of whole world?
(my example consists of two 2m DEMS, one Antarctica polar stereographic in a VRT of thousands of COGs with additional overviews added, a 2m DEM of Tasmania in UTM, and two background global topographies COP30m for land-only, and GEBCO for the entire world topo and bathy) - from that I can get a map of topography anywhere at reasonable scale, and by collecting the world's local hi-res DEMS we could have one grand master topography index, but I think about it for random queries in particular places, or large areas at "reasonable scale" with no special-case problems like "no bathymetry". It covers my needs realy well, topo for any creature flying or swimming, and maps of cool places I'm always making, but that's pretty niche requirement I think.
I'm struggling to understand motivations, but excited to put mine above together to show what I'm on about a bit!
No, a dataset for analysis mostly, not for people to look at in the eyeball sense [although of course that is nice too if can tile it for a WMS or whatever] - could slice by domain, use to infill or combine with other data - and also it is machine learning useable.
Geophysics in this case, not 'look at a mountain location' type map.
Being able to query - give me this country/polygon/box is useful too at times.
This is just one group, there will be others from other places to add in the future, and more new data as it gets produced.
As I understand it though @mdsumner
-
Make a vector dataset with the bounds of each raster in geometry field and the location field is the file path
-
GTI creation references this
-
You tell GTI a CRS and resolution [or other relevant GDAL info] that you want this to end up as.
-
Now if you have a LOT - e.g. Len's above - should each raster be a VRT?
you want to create a image server or tile pyramid of whole world?
Not uncommon to need to serve a large fraction of it to a machine learning model. Like all tiles intersecting most continental coastlines, all tiles intersecting major shipping traffic, all US states, etc. Imagery of interest is 10 meter or possibly higher. And for a long time range, multiple years for sensors with weekly or monthly revisits. Example: https://skytruth.org/cerulean/
- Now if you have a LOT - e.g. Len's above - should each raster be a VRT?
It doesn't matter, you can use VRT to collate things into a single source, including external overviews (which I think is amazing and I do here soon to be updated with object storage ...)
GTI can collate them from disparate sources.
I hear your examples, I don't get what Zarr brings to it, other than a virtual grid frame to standardize to (?). Maybe that's it, effectively what sentinel cubes do but you'd use a sparse Zarr to cache materialized pixel values in a n-D frame.
Yes, caching/persisting the final target grid feels like the most common use case.
Maybe there is or could be some overlap with purpose of virtualizar here in the sense that if a user has many collection-dates (one GTI per date and per collection e.g. product of 10 dates with sentinel2, landsat, modis would be 30 GTI files/indexes) we could build a single aggregate GTI manifest if the combined data if the user doesn't want to duplicate data by writing to zarr, but I would imagine persisting the final combined collection-dates Xarray/zarr dataset would be ideal given resources/time used to reproject/resample. I very commonly take this approach of one vrt per year and collection in order to build one "ML ready" dataset--excited to try GTI very soon! I'll often have on the order of 10s of dates and 10s of collections I'm hoping to put on a standardized grid.
On the other hand we might not want to cache/persist to zarr if we need to parameterize inputs to a connection string e.g. resolution or overview, etc. (assuming GTI somewhat supports that, which I do recall from the docs just not sure to what extent). Then a user would have to index multiple GTI files, open dataset engine rasterio, concat along band, concat along time each time we open, which is fine, but I'm sure someone will have an example that pushes the limits there.
Thinking through some of these examples above and writing to zarr, it seems the sparse nature of some of this high res data would motivate us to only write non-empty chunks. Given we write out from GTI to zarr, having some index on which chunks intersect the geometry induced by GTI might be helpful for future dask optimizations around resolving empty chunks as no-ops... This might not be required though for the case we do persist/cache to zarr given missing chunks (not written if providing write_empty_chunks=False) could already be a way to infer no-ops. This is more of a dask/Xarray topic only though I suppose.
Doesn't seem geozarr/zarr would help with any of this?
imagine persisting the final combined collection-dates Xarray/zarr dataset would be ideal given resources/time used to reproject/resample
Yes - GTI = new, so no-one knows how long this is likely to take at at scale? If making many models and it is a significant time each time what is the computation time vs excess storage costs [assuming that it works].
Correct me if I'm wrong, but I would imagine the reprojection and resampling computation (assume a GTI or VRT already exists) would be identical between xr.open_dataset with GTI or VRT? And similarly if chunks is not None and we have a dask cluster?
I think some speed gains are (assuming you already have geometry and location and potentially other fields as recommended) won due to avoiding the need to inspect each tif needed to make each VRT. It sounds like as long as you already have the metadata for each tif, the GTI index gives us immediate lazy access to the data on the target grid e.g. xr.open_dataset("GTI:...", engine="rasterio", chunks="auto"). That could shake things up in the resample/reproject on the fly dilemma. I suppose the more sparse it is, the better it is for computing on the fly. I also bet the difference in resolution and some projection transforms influence computational needs.
Ok, I create a geopackage [with I create with EPSG:6933 as a test] with the required fields, then
ogr2ogr -f GPKG -mo SRS=EPSG:4326 -mo RESX=10000 -mo RESY=10000 output.gti.gpkg testgti.gpkg
then
import xarray
ds = xarray.open_dataset("GTI:output.gti.gpkg", engine = "rasterio")
gives
this is just using first 10 rasters from a list as a test
<xarray.Dataset> Size: 508MB
Dimensions: (band: 1, x: 2321, y: 27326)
Coordinates:
* band (band) int64 8B 1
* x (x) float64 19kB 4.201e+05 4.203e+05 ... 6.718e+05 6.719e+05
* y (y) float64 219kB 8.495e+06 8.495e+06 ... 5.531e+06 5.531e+06
spatial_ref int64 8B ...
Data variables:
band_data (band, y, x) float64 507MB ...
@mdsumner example had metdata creation options I think - the above being wrong, so I need to fix something I imagine.
gdalinfo output.gti.gpkg
Driver: GTI/GDAL Raster Tile Index
Files: output.gti.gpkg
output.gti.gpkg.aux.xml
Size is 2321, 27326
Coordinate System is:
PROJCRS["WGS 84 / NSIDC EASE-Grid 2.0 Global",
BASEGEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
MEMBER["World Geodetic System 1984 (G2296)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4326]],
CONVERSION["US NSIDC EASE-Grid 2.0 Global",
METHOD["Lambert Cylindrical Equal Area",
ID["EPSG",9835]],
PARAMETER["Latitude of 1st standard parallel",30,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8823]],
PARAMETER["Longitude of natural origin",0,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["False easting",0,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Environmental science - used as basis for EASE grid."],
AREA["World between 86°S and 86°N."],
BBOX[-86,-180,86,180]],
ID["EPSG",6933]]
Data axis to CRS axis mapping: 1,2
Origin = (420087.500000000000000,8495380.000000000000000)
Pixel Size = (108.502270812988854,-108.502270812988854)
Metadata:
RESX=20000
RESY=20000
SRS=EPSG:4326
Corner Coordinates:
Upper Left ( 420087.500, 8495380.000)
Lower Left ( 420087.500, 5530446.948) ( 4d21'13.89"E, 48d59'55.44"N)
Upper Right ( 671921.271, 8495380.000)
Lower Right ( 671921.271, 5530446.948) ( 6d57'50.06"E, 48d59'55.44"N)
Center ( 546004.385, 7012913.474) ( 5d39'31.97"E, 72d50'49.58"N)
Band 1 Block=256x256 Type=Float64, ColorInterp=Gray
NoData Value=nan
That is in 6933 from geopackage still - and maybe whatever resolution the first raster was.
Correct me if I'm wrong, but I would imagine the reprojection and resampling computation (assume a GTI or VRT already exists) would be identical between xr.open_dataset with GTI or VRT? And similarly if chunks is not None and we have a dask cluster?
I think some speed gains are (assuming you already have geometry and location and potentially other fields as recommended) won due to avoiding the need to inspect each tif needed to make each VRT. It sounds like as long as you already have the metadata for each tif, the GTI index gives us immediate lazy access to the data on the target grid e.g. xr.open_dataset("GTI:...", engine="rasterio", chunks="auto"). That could shake things up in the resample/reproject on the fly dilemma. I suppose the more sparse it is, the better it is for computing on the fly. I also bet the difference in resolution and some projection transforms influence computational needs.
That's right, GTI is more general than VRT (or a list of sources) and more lazy because the footprint can be spatially indexed, as well as convenient filtering by other fields upfront (and it's very generic and lightweight).