intake-stac
intake-stac copied to clipboard
Possible approach for stacking items
Been working on this PR for a couple days to try and understand nuances of intake by splitting off from the intake-xarray plugin and thinking about how stacking items might work in practice.
I'm also trying to sort out the advantages of intake-stac versus working directly with STAC catalogs and https://github.com/stac-utils/stac-vrt/issues/11. So I'm going to add some comments in the code and try to get feedback from @jhamman @martindurant @TomAugspurger and @rmg55 .
Currently this is a functional implementation that opens files in sequence, which is fine for small catalogs, but scaling up it would probably be smart to add an optional dependency on stac-vrt
and inject the build_vrt() function as a catalog method.
addresses #65
Screenshot illustrating how this currently works:
data:image/s3,"s3://crabby-images/6f35f/6f35f559c861f2de8032193e6d12d59e01cb33fd" alt="image"
Your issues are similar to the general conversation in intake-xarray and my fsspec/zarr PR to xarray: that the many backends need different inputs, so it's difficult to know when to invoke fsspec. I don't know how to make a comprehensive solution. Maybe keeping things specific, like here, is the way forward.
Hi @scottyhq - thanks for putting this together. Still working through the code and getting familiar with how STAC collections get mapped into intake-stac - so apologies if I am missing some key points in the suggestions below! I think adding these types of features (stacking/mosaicing) into intake-stac would really improve a lot of workflows.
Currently this is a functional implementation that opens files in sequence, which is fine for small catalogs, but scaling up it would probably be smart to add an optional dependency on stac-vrt
If dropping fsspec use and going straight to urls, we could also leverage the STAC metadata to create xarray objects with dask.delayed and dask.array.from_delayed that would be fast. Here is an example opening 500 assets in ~1 sec: https://gist.github.com/rmg55/ae5bc19b94345cc2033ef9261a229e24
trying to sort out the advantages of intake-stac versus working directly with STAC catalogs and stac-utils/stac-vrt#11
I am also having trouble wrapping my head around the best approach for this. I am conceptualizing this into two categories spatial mosaic and temporal stacking. Here are a few thoughts about what might be great to try and support long-term (realizing we probably want to start small and build features in).
Mosaic: Images collected at the same time and need to be stitched together. This is probably better accomplished with the stac_vrt approach. However, I think an interesting intake-stac + stac_vrt workflow could accomplish something like:
-
catalog.mosaic(by='time')
, which would use a "group_by" approach to group assets by date, and pass to stac_vrt, and produce a new catalog with the mosaiced vrts. Perhaps even be able to do something likecatalog.mosaic(by='time.weekofyear',how='max')
- similar to pandas/xarray approach and leverage vrt pixel functions for the aggregations.
Stacking: Temporally stacking assets to create a time series (x,y,band(s),time). I think this could be accomplished in either intake-stac (perhaps with the approach mentioned above) or in stac_vrt. Thoughts below on using both approaches:
- stac_vrt:
- Advantage is that you are left with an catalog, that could be written to disk and shared with others. In this approach I think we would need to write to disk the in-mem vrt files (perhaps to a .vrt folder co-located with the catalog or use the intake caching mechanism).
- Disadvantage - limited to 3dims, therefore we would need to have a vrt per band (although this is perhaps something we want to do anyways?)
- intake-stac:
- Advantage: supports 3+ dimensions (e.g. x,y,time,band).
- Disadvantage: End up with an xarray object, not an intake-stac catalog.
I'd be happy to contribute to this effort once we can settle on a best approach.
If dropping fsspec use and going straight to urls, we could also leverage the STAC metadata to create xarray objects with dask.delayed and dask.array.from_delayed that would be fast. Here is an example opening 500 assets in ~1 sec: https://gist.github.com/rmg55/ae5bc19b94345cc2033ef9261a229e24
Thanks @rmg55 this is awesome! This approach of getting the dimension and coordinate info from STAC JSON to delay file I/O is really slick! The disadvantage I see is the possibility for discrepancies when data is actually read. For example, note that you have an error with coordinate parsing in your function, which I came across when I tried to plot the resulting dataarray:
def _delayed_open(stac_item, band):
transform = stac_item['assets'][band]['proj:transform'][0:6]
nx,ny = stac_item['assets'][band]['proj:shape']
# NOTE: xarray and rioxarray docs are wrong here, no need for from_gdal...
# see https://github.com/pydata/xarray/issues/5005
#x, y = Affine.from_gdal(*transform) * (np.arange(nx)+0.5, np.arange(ny)+0.5)
x, y = Affine(*transform) * (np.arange(nx)+0.5, np.arange(ny)+0.5)
href = stac_item['assets'][band]['href']
date = datetime.datetime.fromisoformat(stac_item['properties']['datetime'][0:10]).date()
# changed to darray because i always do `da=xr.open_rasterio()` ;)
data = darray.from_delayed(dask.delayed(xr.open_rasterio)(href).data, (1,nx,ny), dtype=float)[np.newaxis,:,:,:]
# swap order of x and y to be consistent with ordering returned from xr.open_rasterio
da = xr.DataArray(data=data, coords=([date],[band],y,x),dims=('time','band','y','x'))
return da
you bring up good points and great ideas for stacking and mosaicing options, continuing to prototype some examples here or via gists is definitely helping to figure out an approach.
@scottyhq - thanks for pointing out those bugs in the gist (updated recently to incorporate your edits). Agreed on your assessment that errors are not surfaced until the data is read - which could make tracing errors more confusing for users.
Also, I spent some time exploring the stackstac project by @gjoseph92 - which is really awesome. Here is a gist of using it with NASA HLS STAC: https://gist.github.com/rmg55/b144cb273d9ccfdf979e9843fdf5e651
One thing to note is that stackstac is able to work with STACs that are missing the projection extension by using the bbox, user input on resolution+crs:epsg, and warped_vrt. Note this can lead to slightly different shape/transform than the original file (see the above gist). Perhaps something to consider for stac_vrt as well?
Hey @scottyhq! It does seem like we're working on similar things. Do you think stackstac could be helpful here? I could imagine intake-stac using it under the hood to generate the xarray, but I'm not sure if there are any intake-specific details that it can't handle right now.
Do you think stackstac could be helpful here? I could imagine intake-stac using it under the hood to generate the xarray
Definitely! I abandoned this PR for a bit but would like to come back to it soon with STAC 1.0 coming out in the near future. I think I'll probably delay integrating stackstac and do that in a follow up.
I'm also watching https://github.com/corteva/rioxarray/issues/246 and the linked PR in rasterio https://github.com/mapbox/rasterio/pull/2141 , which will have some bearing on how fsspec and intake-xarray are currently used (as discussed in above comments)
@scottyhq is there anything that this would handle that stackstac
doesn't? At a glance it seems like it already handles everything, though I might be missing things.
So I could imagine loading a catalog, maybe a .search()
to filter it, and a .stack()
to pass the ItemCollection
to stackstac
.
is there anything that this would handle that stackstac doesn't? At a glance it seems like it already handles everything, though I might be missing things.
This approach is much less-sophisticated than stackstac, so I think I might abandon this PR, or need to re-do the .stack() method adding an optional dependency on stackstac. That said, I think it would be nice to have consistency in what to_dask()
returns whether it's a single asset or stack of assets. Specifically, whether we automatically set up the rioxarray .rio accessor with CRS stored in the coordinates... https://github.com/gjoseph92/stackstac/issues/50
👍 great point about the consistent CRS handling. I'm planning to turn to that after getting pystac 1.0 support and collection-level assets handled in intake-stac.
FWIW, my hope is that intake-stac can be a pretty simple wrapper around pystac, pystac-client, and stackstac. Ideally we limit the complexity here.
FWIW, my hope is that intake-stac can be a pretty simple wrapper around pystac, pystac-client, and stackstac. Ideally we limit the complexity here.
In full agreement :)
I just added you as an admin for this repo @TomAugspurger, so that you have the ability to merge things