torchgeo Add STACAPI dataset

This PR closes #403 by adding a STACAPI dataset. I will try to explain my thinking for this proposal:

Here are some explanations for choices I made:

constructor takes the usual root, crs, res, and transforms argument like other raster datasets
additionally, there is a bands, api_endpoint and **query_parameters argument. I thought the bands argument should be separate because other raster datasets also specifically have this argument

When the getitem method receives a bbox query:

a single item that is part of the hit objects will be merged together to a single raster and computed resulting in a 4D xarray of (time x channel x height x width)
since it can happen that multiple hits are returned for a particular bbox query, these have to be merged. Usually, this happens with rasterio because they are datareader objects or filepaths and not xarray resulting from stackstac. I only found the rioxarray library, which does merging for xarrays so this library is currently also required
however, since rioxarray.merge only supports 2D or 3D tensors and potentially stac returns many samples over a time period, I loop over the time dimension, merge with rioxarray per time step and afterwords stack back together into one final tensor which will be returned as an image in the sample
squeeze out the time dimension if it is only a single time instance

Questions / Left to do:

several new libraries needed, maybe some can be avoided like rioxarray?
computing the aoi can potentially take fairly long albeit it is much faster than I expected
handling the resolution self.res parameter with each separate band having a different resolution
deciding whether an image will be under image or mask key because StacAPI has both images and labels, so maybe another argument for the constructor?
given the last point, need to implement a plot method that can handle both scenarios of the dataset containing images or labels testing with sampler and dataloader

Feb 17 '22 21:02 nilsleh

instead of what is currently in __getitem__ we can use something like:

items = TODO # list of the STAC Items that the current query intersects with (based on the results from the rtree)

stack = stackstac.stack(
  items,
  assets=self.bands,
  resolution=self.res,
  epsg=TODO # convert self._crs to an epsg code
)

aoi = stack.loc[..., maxy: miny, minx: maxx]
# can do some checking here to see what's up with the time dimension
data = aoi.compute()
out_image = data.data

Feb 20 '22 19:02 calebrob6

several new libraries needed, maybe some can be avoided like rioxarray?

Generally, we want to minimize direct dependences

computing the aoi can potentially take fairly long albeit it is much faster than I expected

I have previously benchmarked different ways of transforming coordinates (https://gist.github.com/calebrob6/334af85deb4c862689020ae4b3344bfc) and pyproj was by far the fastest. The constructor may be sped up if you replace the fiona.transform.transform(...) call with a pyproj version.

handling the resolution self.res parameter with each separate band having a different resolution

I think we have to reproject to a single resolution. I think it makes sense to choose the highest resolution if the user doesn't specify (what stackstac will do by default).

deciding whether an image will be under image or mask key because StacAPI has both images and labels, so maybe another argument for the constructor?

I think this makes sense -- what do you think @adamjstewart?

given the last point, need to implement a plot method that can handle both scenarios of the dataset containing images or labels

I think we can't have a plot for this class as it is too generic (e.g. this could be used to grab anything on the Planetary Computer).

Feb 20 '22 19:02 calebrob6

Another thought -- for the planetary computer specifically, we will need to sign STAC Items before we can load data from them. Perhaps there should be a Callable "item_transformer" argument that takes an Item as input and returns an Item, and is an identity transform by default? For the planetary computer case we can make a transform that signs the item.

@TomAugspurger, any thoughts in general here?

Feb 20 '22 19:02 calebrob6

For me, the biggest remaining question is how to integrate this with RasterDataset/VectorDataset. Is the intention that we will have subclasses of STACAPIDataset for each dataset on the Planetary Computer? For things like Landsat and Sentinel that are available in both STAC and non-STAC forms, will we have a single class or will we have Landsat and LandsatSTAC classes?

Feb 20 '22 19:02 adamjstewart

how to integrate this with RasterDataset/VectorDataset

What do you mean here? (e.g. I would consider this integrated as it extends RasterDataset)

Is the intention that we will have subclasses of STACAPIDataset for each dataset on the Planetary Computer?

No, I don't think so. Users could easily use it with the Landsat8 collection on the Planetary Computer, but would need to do more work if they wanted the features that come with the Landsat dataset.

Feb 20 '22 20:02 calebrob6

@calebrob6 I guess we're not on the same page then, I've always thought of STAC integration as no different than finding files on your filesystem. My hope was that we could have a single dataset Landsat (or rather, Landsat8, etc.) that would work for both locally downloaded files without a STAC index and remote files with a STAC index. Just like we have subclasses of RasterDataset for various datasets, I think we'll want subclasses of STACAPIDataset for each dataset. Is there any reason that we don't need this? Or would you suggest that Landsat8 isn't needed and users should just directly use RasterDataset and pass in a file glob or list of files or something?

Feb 20 '22 21:02 adamjstewart

I see what you mean now -- I think we are only not on the same page simply because you are 1 chapter ahead of me ;)

I've always thought of STAC integration as no different than finding files on your filesystem

I like this! Some thoughts:

If we implement this for e.g. Landsat8 then we would probably be coupled to a particular STAC API (e.g. "the planetary computer API").
I think STACAPIDataset should still be able to function independently, similar to how you can pass some TIFFs to RasterDataset, so that we don't have to implement a specific dataset for every possible collection that a user wants to use that is hosted behind a STAC API.
Maybe a "mixin" type pattern is appropriate for implementing this? Landsat8 is "STACAPI'able", Sentinel2 is "STACAPI'able", etc.

Feb 23 '22 17:02 calebrob6

instead of what is currently in __getitem__ we can use something like:

items = TODO # list of the STAC Items that the current query intersects with (based on the results from the rtree)

stack = stackstac.stack(
  items,
  assets=self.bands,
  resolution=self.res,
  epsg=TODO # convert self._crs to an epsg code
)

aoi = stack.loc[..., maxy: miny, minx: maxx]
# can do some checking here to see what's up with the time dimension
data = aoi.compute()
out_image = data.data

@calebrob6 I have been trying out your suggestion with different bounds and stacks, but here the .loc[] indexing always returns an array of shape [num_items, channels, 0, 0] so no height or width, and I can't seem to figure out what is causing that.

Feb 27 '22 11:02 nilsleh

At the moment this STACAPIDataset implementation seems to be specific to just Sentinel-2, and it also assumes the use of Planetary Computer. Maybe we should have a think about how to make this a bit more generic, the easiest way I can think of is to use some try...except or if...then statements to see if the api_endpoint contains planetarycomputer, but will need to think this through a bit more.

Thank you for reviewing and finding the error, I now understand what I was doing wrong. Mainly I thought to make a draft and self contained example of how such a dataset class could work, so that is why it is so specific at the moment.

Apr 19 '22 17:04 nilsleh

Does this need any support with development? I'd like to contribute my experience with STAC if that would help. In particular to generalize the approach for a range of different STACs. Let me know the best way to do that if so (PR to your fork @nilsleh ?).

I'm aware potential changes to torchgeos approach to DataLoaders are incoming (https://github.com/microsoft/torchgeo/issues/576#issuecomment-1192893790). Will that require a complete rewrite of this?

Aug 31 '22 14:08 jamesvrt

I'm aware potential changes to torchgeos approach to DataLoaders are incoming (#576 (comment)). Will that require a complete rewrite of this?

If we're going for torchdata, then yes, this will need to be rewritten completely. I've been thinking of implementing a torchdata reader that does just the STAC reading part (i.e. no transforms or anything). The idea is to then feed the STAC urls to another DataPipe that uses stackstac or odc-stac (see also https://github.com/opendatacube/odc-stac/issues/54#issuecomment-1103313511), and then feed that into other 'geo'-DataPipes. Would be great to hear your opinion on this!

Aug 31 '22 16:08 weiji14

Yes, in the long run, the plan is to move towards TorchData, although I don't know when that will happen. I'm fine with having a STACAPIDataset base class while we slowly transition towards TorchData, or I'm fine with experimenting with this as our first TorchData use case. I'm not super familiar with TorchData myself, need to block off a week where I can experiment with it to better understand what it entails.

Aug 31 '22 17:08 adamjstewart

@weiji14 @jamesvrt having looked through the code for this current STACAPIDataset it appears there is only support of catalogs served over an API via catalog querying with pystac_client. Is the intention for this custom Dataset to support any STAC catalog, or only those served over an API via pystac_client? How would Torchdata change that integration (I'm not familiar with it either)?

Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our radiant-mlhub Python client v0.5.1+, which will include functionality to build image and mask tensors by traversing a STAC catalog locally. Seems there's potential for overlap here, but perhaps best to keep them as separate. Thoughts?

Aug 31 '22 22:08 KennSmithDS

Seems there's potential for overlap here, but perhaps best to keep them as separate.

We can keep them separate at the start, but I imagine a future in which RadiantDataset is a subclass of STACDataset (names TBD). We definitely want to avoid code duplication if we can.

Aug 31 '22 22:08 adamjstewart

having looked through the code for this current STACAPIDataset it appears there is only support of catalogs served over an API via catalog querying with pystac_client.

Currently yes, this PR does use pystac_client, and is somewhat hardcoded to the Planetary Computer STAC API.

Is the intention for this custom Dataset to support any STAC catalog, or only those served over an API via pystac_client?

At the very least, we should generalize this PR's implementation to work for any Planetary Computer dataset on https://planetarycomputer.microsoft.com/catalog. Ideally though, it should work for any API endpoint supported by pystac_client.

How would Torchdata change that integration (I'm not familiar with it either)?

Torchdata (see https://pytorch.org/blog/pytorch-1.11-released/#introducing-torchdata) would decompose the current implementation into multiple parts as it follows a composition over inheritance philosophy.

Reason I ask is because I'm going to create a new custom base Dataset class for working with datasets downloaded from Radiant MLHub using our radiant-mlhub Python client v0.5.1+, which will include functionality to build image and mask tensors by traversing a STAC catalog locally. Seems there's potential for overlap here, but perhaps best to keep them as separate. Thoughts?

Now, I'm not familiar with radiant-mlhub either, but one idea might be to have a PySTAC torch Dataset/DataPipe that has an engine parameter with options of radiant-mlhub or pystac-client? Or we could settle on the common ground and build a torch DataPipe on just pystac, with separate radiant-mlhub or pystac-client interfaces on top.

Aug 31 '22 22:08 weiji14

@adamjstewart @weiji14 thanks for the quick responses!

It seems there is possibly a need for an even more generic base class STACDataset based on STAC specifications. Yet the way the catalog would reference STAC objects might differ depending on if it's a local directory of Catalog, Collection and Item objects in JSON files, vs. an API; the key distinction I can think of off the top of my head being, the former has access to traversing the entire catalog on disk vs Items that are returned by the API with pagination.

I was not intending to use the PySTAC library for the MLHubDataset (or RadiantDataset, whichever) because of performance issues we've ran into when loading and validating STAC objects and item relationships within catalogs. When a catalog grows to hundreds of thousands or millions of STAC objects (as is the case in a few our our recent datasets) the computation can become $O(n^2)$ (if I recall, I'll have to double-check the code base). We have found using multiprocessing and direct JSON object manipulation to be much more performant in our workflows. However, PySTAC does a great job of enforcing STAC object compliance to the specification and extensions.

@weiji14 the radiant-mlhub Python client is a library that half a dozen Torchgeo datasets use to fetch datasets from our platform. When we released v0.5.1, this gave us reason to consider 1) updating the existing Torchgeo datasets to be compliant with the new client functionality, and/or 2) create a new base class that interacts with our Python client directly, upon which to build children classes that will extend the MLHubDataset base class. Our STAC catalogs contain some metadata that is not standard but is still STAC compliant, and we also use a slightly unconventional architecture for serving our catalogs over API using stac-fastapi which would make it less than straightforward for STACAPIDataset to work with our datasets directly.

Aug 31 '22 23:08 KennSmithDS

@KennSmithDS didn't realize GitHub supports LaTeX now, you've just made my day!

Sep 01 '22 00:09 adamjstewart

If I understand all this correctly, there are two main aspects to consider:

Providing access to local STAC Catalogs as well as STAC APIs
- For local Catalogs, how much flexibility for the reading engine?
- For APIs, how hardcoded to the Planetary Computer should this be?
- For APIs, how much flexibility for unconventional architectures (e.g. Radiant MLHub)?
Using torchdata or the existing approach
- If torchdata, focus now on only the STAC reading part (no transforms etc.)?

Given these, I think it makes sense to use torchdata already. It allows more granular control on reading STAC, since having separate STACAPI and STACLocal DataPipes would disconnect two quite different approaches for accessing a STAC. These DataPipes should cover most use cases without modification, but could act as a base for more niche requirements (reading engine, API configuration, auth etc.). These DataPipes could be connected by a user to whichever raster reading DataPipe makes sense for their target datastore.

To get a workable first pass at this we might focus first on a STACAPI DataPipe and a stackstac DataPipe, with the intention of serving users of Planetary Computer and open STAC APIs (e.g. element-84's earth search).

What do you think?

Sep 01 '22 13:09 jamesvrt

Given these, I think it makes sense to use torchdata already. It allows more granular control on reading STAC, since having separate STACAPI and STACLocal DataPipes would disconnect two quite different approaches for accessing a STAC. These DataPipes should cover most use cases without modification, but could act as a base for more niche requirements (reading engine, API configuration, auth etc.). These DataPipes could be connected by a user to whichever raster reading DataPipe makes sense for their target datastore.

To get a workable first pass at this we might focus first on a STACAPI DataPipe and a stackstac DataPipe, with the intention of serving users of Planetary Computer and open STAC APIs (e.g. element-84's earth search).

What do you think?

Yep, totally agree on going straight for torchdata! Let's start from a pure and simple STAC-something DataPipe using the basic tools in https://github.com/stac-utils. I've coded up a draft pystac.Item.from_file() DataPipe over at https://github.com/weiji14/zen3geo/pull/46, and we can slowly add in the pystac-client and stackstac DataPipe pieces next :smile:

Sep 01 '22 15:09 weiji14

@weiji14 thanks for sharing the development on weiji14/zen3geo#46, this is great!. This raises a few questions in my mind looking at zen3geo readthedocs:

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?
Is there functionality for reading in the catalog and collections as well or just items? (looking at the source code just Items) :)
Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

Sep 01 '22 16:09 KennSmithDS

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?

Yes for STACLocal, no for STACAPI? My impression of a STACAPI DataPipe was that it would be a wrapper around https://github.com/stac-utils/pystac-client, i.e. for STAC Catalogs/APIs. But maybe I'm getting confused by the terminology...

Is there functionality for reading in the catalog and collections as well or just items? (looking at the source code just Items) :)

We could have a PySTACCatalogReader wrapping pystac.Catalog.from_file and PySTACCollectionReader wrapping pystac.ItemCollection.from_file perhaps?

Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

Yep, and there's probably a few ways to go from many STAC items to an xarray.DataArray. The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best. E.g. STAC Items -> ItemCollection -> stackstac xarray.DataArray; STAC items -> xarray.DataArrays -> xarray.Dataset; etc.

Sep 01 '22 17:09 weiji14

The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best.

@weiji14 just to make sure I understand, is this something that TorchGeo users will need to do, or something that TorchGeo developers will need to do? I'm still hoping to keep a single Dataset instance for each dataset that optionally allows users to choose where to download data from. Is that possible, or is that counter to the idea of DataPipes? Still catching up on this TorchData stuff...

Sep 01 '22 17:09 adamjstewart

I see that the PySTACItemReader class can take either a URL or a local file path and yields a STAC Item. Doesn't this already address both the STACLocal and STACAPI datapipe requirements?

@KennSmithDS @weiji14 Maybe I confused the issue by calling the proposed class STACLocal. It should be STACStatic. A static STAC Catalog can be hosted locally or remotely, which is why PySTACItemReader can take either a local file path or URL. The difference versus accessing a STAC API using pystac_client is you can search Items by attribute when using the API, something you cannot readily do with a static catalog.

Sep 01 '22 17:09 jamesvrt

The power of torchdata is that because it's using composition over inheritance, each of these DataPipe pieces can be built separately, and the data engineer would then chain them up based on what they think is best. E.g. STAC Items -> ItemCollection -> stackstac xarray.DataArray; STAC items -> xarray.DataArrays -> xarray.Dataset; etc.

Been following this discussion and very interested in the solutions being discussed! I think there might be some solutions for iterating over xarray DataArrays that TorchGeo could standardize on, rather than leaving it to a data engineer to decide whether to represent an imagery dataset as either a DataArray or Dataset.

Above I saw mention of stackstac which takes in an ItemCollection and returns an Xarray. To generalize the workflow a bit, regardless of if the catalog is local or remote, something will read in STAC Items, make an iterable/collection of items, which can then be passed to stackstac to create the xarray.DataArray object from the raster assets.

After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher

I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher. Performance of data loading for ML is an open issue https://github.com/xarray-contrib/xbatcher/issues/37

Sep 01 '22 18:09 rbavery

Before we get too far down the line with xarray discussions, note that TorchGeo does not currently use xarray for anything. We currently rely on rasterio/fiona (aka GDAL) for all file I/O since it can handle reprojection/resampling. Ultimately, all arrays will end up as PyTorch Tensors, so it doesn't really matter what the intermediate step is. But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed.

Sep 01 '22 18:09 adamjstewart

After creating the xarray.DataArray, to then make an iterable and performant xarray object for dataloading, this sounds like a possible job for xbatcher

I haven't used xbatcher yet, but on the face of it, it seems like this might be the library to consolidate performance improvements to loading remotely sensed imagery from xarray objects, and maybe there could be a TorchDataPipe built around xbatcher.

Done at https://github.com/weiji14/zen3geo/pull/22, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/chipping.html. (Disclaimer: I'm actually an xbatcher dev).

But we need something that can support reprojection/resampling, which numpy/torch/xarray cannot. Maybe something like rioxarray is needed.

Done at https://github.com/weiji14/zen3geo/pull/6, see tutorial at https://zen3geo.readthedocs.io/en/v0.3.0/walkthrough.html#read-using-rioxarrayreader

Anyways, this DataPipe discussion really should be in #576, let's focus on the STAC discussion :slightly_smiling_face:

Sep 01 '22 18:09 weiji14

@adamjstewart Xarray comes as a requirement of stackstac and odc-data, if we were to make PipeLines using either of those packages (similar to what @nilsleh has done in this PR). They're not necessary by any means, but may simplify the STAC Item --> Tensor piece. Agreed rioxarray is an extra requirement for reprojection and general spatial awareness.

More STAC related... shall I make a start on the STACAPI PipeLine? Maybe we can use a draft PR to drive further discussion.

Sep 01 '22 18:09 jamesvrt

Apologies, I have been busy finishing my master thesis until a couple of days ago. I was planning to pick this up again but it seems like there is a lot of interest and expertise around this already. When I started this it was merely an exploration of trying an approach. I don't have a preference to how you proceed so feel free to take this over @jamesvrt @weiji14 @KennSmithDS

Sep 02 '22 08:09 nilsleh

Sorry @nilsleh, I didn't mean to sidetrack from your work here. Congratulations on finishing your masters. Happy to move this discussion elsewhere if that makes more sense to the organisers of this project.

I made a start at writing a DataPipe for a generic STAC API, but having done a little digging into torchdata I'm not sure that DataPipes are intended for API calls. This is because they operate on iterables or mappable objects. Given pystac_client already exists for the purpose of retrieving Items from a STAC API as an iterable, writing a DataPipe around it that is called only once might be adding redundancy and introduce an unnecessary requirement to torchgeo. I don't have a good grasp on torchdata yet, so please let me know if these findings are inaccurate.

If I'm correct about the above, maybe it makes more sense for the user to retrieve all of their desired STAC Items up front using their own method (probably pystac related) and feed these as an iterable into a class built on torchdata.datapipes.iter.IterableWrapper, like the one @weiji14 created in https://github.com/weiji14/zen3geo/pull/46. Something like this:

from pystac_client import Client
from torchgeo.datapipes import STACItemReader

catalog = Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")

search = catalog.search(
    collections=["landsat-c2-l2"],
    bbox=bbox_of_interest,
    datetime=time_of_interest,
    query={"eo:cloud_cover": {"lt": 10}},
)

items = list(search.get_items())
dp = STACItemReader(iterable=items)

This approach has the advantage of allowing users to use their own Item discovery methods for static or API-based STACs.

The difficult part as I can see it is selecting the required Item Assets and providing an appropriate method for IO. For those who aren't familiar with STAC, a single Item can contain many Assets (e.g. individually stored bands, masks, thumbnails), with no requirement for these assets to be stored in the same location (e.g. S3) or format (e.g. Cloud Optimized GeoTIFF). Thankfully, torchdata already contains many DataPipes for different locations and formats (see https://github.com/pytorch/data/issues/672 and TorchData IO DataPipes)

In my mind, an ideal STACItemReader would:

Take an iterable of STAC Items (Iterable[pystac.Item] or Iterable[Dict])
Allow the user to specify which Assets to retrieve from each Item (e.g. list of bands)
Provide the correct IO for each Asset given its (location, format)
Capture the appropriate metadata from the STAC Item. For example, imagery stored in a non-georeferenced format will require the geometry attribute of the Item.

stackstac and odc-data do these already in their own way. I don't know how well they will work in a DataPipe, given they use xarray and its lazy loading.

Sep 02 '22 17:09 jamesvrt

In my mind, an ideal STACItemReader would:

Take an iterable of STAC Items (Iterable[pystac.Item] or Iterable[Dict])

Allow the user to specify which Assets to retrieve from each Item (e.g. list of bands)

Provide the correct IO for each Asset given its (location, format)

Capture the appropriate metadata from the STAC Item. For example, imagery stored in a non-georeferenced format will require the geometry attribute of the Item.

You're on the right track :smile: This entire thing sounds more like a DataPipeLine rather than a DataPipe though, but I think a curated DataPipeLine like this (which chains-together multiple DataPipes) is what torchgeo will end up doing (once the matter of optional dependencies at https://github.com/microsoft/torchgeo/discussions/662 is resolved).

While it's possible to create a DataPipe with several DataPipes inside, you only have to look at torchdata's code (e.g. at https://github.com/pytorch/data/tree/v0.4.0/torchdata/datapipes/iter) to see that each official torch DataPipe focuses on one task only. E.g. one function with one dependency. So I recommend having STACItemReader do only STAC Item reading, and have a separate StackSTACReader read these STAC Item objects into an xarray.DataArray (while handling the Asset loading/filtering). This separation of concerns makes it less user friendly perhaps, but it's easier to write and maintain for developers and more suited for power users wanting full customization of their data pipeline.

P.S. I've started a page at https://github.com/weiji14/zen3geo/discussions/48 summarizing some of the ideas in this thread, feel free anyone to continue the discussion there!

P.P.S. Sorry @nilsleh for going off track from your PR, I'll keep quiet now :slightly_smiling_face:

Sep 02 '22 19:09 weiji14

torchgeo torchgeo copied to clipboard

Add STACAPI dataset

torchgeo
torchgeo copied to clipboard