stactools Document best practice: I/O-free STAC item generation

This issue is discussing what is (IMO) a best-practice for stactools packages: the ability to generate a STAC item without any I/O.

Currently most stactools packages have a high-level stac.create_item(asset_href: str, ...) -> pystac.Item function that generates a STAC item from a string. If the method requires reading any data / metadata, it will handle that I/O. This is very convenient, and ideally every stactools package has a way of doing this (especially useful when using a CLI).

Some of the more complicated stactools packages also generate cloud-optimized assets from the "source" asset at asset_href. In some of these packages, whether the output STAC item catalogs the cloud-optimized asset is directly tied to that function creating the cloud-optimized asset itself (see https://github.com/stactools-packages/goes-glm/blob/c9c3bc42685e66e0eaace599096ef6050c05eb57/src/stactools/goes_glm/stac.py#L46-L47 for example).

At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.

Now we have a couple ways to handle this:

The user passes all the hrefs to both the source asset and the cloud-optimized asset. The create_item method is responsible for reading the data:

def create_item(source_asset_href, cloud_optimized_asset_hrefs, ...):
    ...

If the user provides cloud_optimzied_asset_hrefs then cloud-optimized asset (re)generation can be skipped. 2. The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

def create_item(source_data, cloud_optimized_data):
    ...

Of these, I think we should steer package developers towards option 2, but I'm curious to hear others' thoughts. That's the approach taken by stac-table and xstac, and I think it works pretty well. Users are able to provide (essentially) any dataframe or Dataset and we can generate STAC metadata for it. Crucially, all of rasterio, pyarrow / dask.dataframe, and xarray can lazily read data so creating / passing around a DataFrame or Dataset doesn't actually read data (unless it's required by the method).

Nov 01 '22 14:11 TomAugspurger

At a minimum, it should be easy to regenerate STAC metadata (including metadata for the cloud-optimized assets) without having to regenerate the cloud-optimized assets.

I agree.

The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

So you would pass, e.g., a pandas dataframe, to create_item (and potentially the href) rather than just the href to the dataframe? A COG would be handled by passing src from something like with rasterio.open("cog_href") as src:? I'm not clear on the advantage of passing the "data". Why not allow any data or metadata reading from the href to happen inside the create_item function?

Nov 01 '22 18:11 pjhartzell

The user passes in the data (and perhaps the hrefs, to easily set the href for each asset).

I think we need hrefs for each Asset.href -- if we just have the data, how does the asset know where to point?

Nov 04 '22 13:11 gadomski

Hmm I had a reply but I might have closed that browser tab before submitting it.

We chatted a bit about this on Tuesday, but I unfortunately mixed up two things here.

Easily regenerate (all) STAC metadata without having to regenerate cloud-optimzied assets.
Provider APIs for generating STAC metadata from data objects, rather than (just) from HREFs

Hopefully 1 is uncontroversial, and is fairly straightforward to implement. Any function that takes an asset_href and genertes cloud-optimzied assets should also take hrefs for the cloud-optimzied assets. If provided, those (existing) cloud-optimzied assets should be used to generate the STAC metadata.

Item 2 is more subjective. I've just found it handy in the past to avoid filesytem(s) and I/O in functions where possible. One example use-case is generating STAC metadata from in-memory data structures as a performance optimization (reading / writing to disk can be relatively slow, and if you already have the data in-memory why pay that cost?)

My hope is that the only cost on package developers is a single extra layer of indirection. I think most packages would be structured like

def create_item(asset_href, ...):
    data = read_href(...)  # into a rasterio.Dataset / dataframe / table / xarray structure / ...
    return create_item_from_data(data, asset_href)

def create_item_from_data(data, asset_href):
    ...

I suspect most packages are doing something like this, only the create_item_from_data might not be refactored into a standalone function.

The user passes in the data (and perhaps the hrefs, to easily set the href for each asset). I think we need hrefs for each Asset.href -- if we just have the data, how does the asset know where to point?

Indeed, an href would be required in addition to the data object.

Nov 04 '22 16:11 TomAugspurger

Provider APIs for generating STAC metadata from data objects, rather than (just) from HREFs

I think its possible, but it gets sticky when you need to look up some static information that's not contained in the dataset, such as classification semantics for a multi-band NetCDF dataset where each band is its own cog. In mostly-code:

RASTER_BANDS = {
    "sea_ice_concentration": ...,
    "sea_ice_other_variable": ...,
}

def create_item_from_data(data, asset_href):  # data is a rasterio dataset, asset_href is a COG href
    item  = Item(...)
    item.add_asset("data", Asset(...))
    raster = RasterExtension.ext(item.assets["data"], add_if_missing=True)
    raster.bands = [RASTER_BANDS[variable]]  # <-- where do I learn what variable this COG represents?

I could parse information about what variable from the file name, but that feels icky to me. I think you're still going to need to hit the original source NetCDF. So, I think the pattern is good for a one-to-one use-case, but it gets harder for one-to-many.

I suspect most packages are doing something like this, only the create_item_from_data might not be refactored into a standalone function.

Agreed, especially for the simple one-to-one case.

For a real world example of how I'm trying to work around this, here's me skipping re-creation of COGs when they already exist for a many-to-many NetCDF->COG dataset: https://github.com/stactools-packages/noaa-cdr/pull/39

Nov 04 '22 17:11 gadomski