geo-deep-learning icon indicating copy to clipboard operation
geo-deep-learning copied to clipboard

Define and Store metadata in STAC

Open ymoisan opened this issue 6 years ago • 7 comments

Currently, training is performed on a list of GeoTIFF input images using reference data in GeoPackage files. That list of inputs is stored in csv files. For the results we store just the weights of our model (.pth file).

To make our models interoperable, we need to write out the model together with related weights; those items are our final shareable outputs. Also, should we care to implement checks on whether a particular dataset is amenable to inference using a given model, we need to store all inputs somewhere.

Initially we thought of using HDF to store both the inputs to and outputs of our models. It now appears one of the STAC extensions might be a more logical approach, as STAC is much more web-friendly than HDF.

ymoisan avatar Feb 05 '19 19:02 ymoisan

Mandatory information to store with the model, for re-usability:

  • Weights (.pth)
  • Model definition (e.g. Unet model)
  • Task type (e.g. classification or semantic segmentation)
  • Number of classes and surely their definition (e.g. 1-Vegetation, 2- Lake, 3- Building, etc.)
  • Number of band used for training and their definition (e.g. 4 bands: R-G-B-PIR);
    • The definition should describe the source of each band:
      • Sensor type (e.g. Satellite, LiDAR, aerial photos, radar, etc.)
      • Acquisition date
      • Wavelength (if applicable)
      • Preprocess (if applicable)
  • Spatial resolution to which the training was conducted
  • Geographic location where the training/validation and tests were conducted. (e.g. bounding box or footprint, maybe?)

Optional information to store:

  • Training and validation accuracy
  • Training parameters (e.g. learning rate, # of epoch, class weights, etc.)

mpelchat04 avatar Feb 06 '19 15:02 mpelchat04

A nice way of validating if inputs are applicable to a given model implemented as a decorator : see "input validation" in A comprehensive guide to putting a machine learning model in production using Flask, Docker, and Kubernetes.

ymoisan avatar Feb 07 '19 21:02 ymoisan

If we wanted to devise some kind of standard for model interoperability around HDF5, we would likely come up with a HDF5 product definition. Interesting excerpts from [HDF Product Designer](https://wiki.earthdata.nasa.gov/display/HPD/HDF+Product+Designer ++):

The Hierarchical Data Format (HDF5) provides a flexible container that supports groups and datasets, each of which can have attributes. In many ways, HDF5 is similar to a directory structure in a file and, like directory structures, the same data can be structured and annotated in many ways. This flexibility empowers HDF5 users to arrange data in ways that make sense to them. However, it can make it difficult to share data ... Many communities have successfully addressed this problem by creating conventional structures and annotations for data in HDF5. This approach depends on data files (e.g., products) that carefully follow these conventions. A HDF5 product is the content that should exist in a single HDF5 file. This content is defined by the HDF5 objects (groups, attributes, datasets), their names, the hierarchies they create (links and references), and attribute values. Dataset values are typically not stored in such files (unless they qualify as metadata) thus this software cannot be used as a data server. Once completed, a HDF5 product is replicated in many files (commonly on the order of tens of thousands or more) and filled with real data.

How would the use of HDF5 help us in forming totally independent DL containers that would contain all the information needed for interoperability ? Could we implement something in relation to "standardised environments" as per OGC Testbed 14 ?

ymoisan avatar Feb 08 '19 19:02 ymoisan

How well does HDF5 play with Big Data infrastructures and OGC services like WCS ? Could the H5Server be useful ?

ymoisan avatar Feb 08 '19 21:02 ymoisan

Could we integrate STAC fields ?

ymoisan avatar Apr 05 '19 17:04 ymoisan

deepdish ? torch hdf5 ?

ymoisan avatar Jul 16 '19 16:07 ymoisan

EO profile of STAC includes items such as sun azimuth and elevation : https://github.com/radiantearth/stac-spec/blob/master/extensions/eo/schema.json. Type 20170831_162740_ssc1d1 in your browser search bar and you'll en up here :

image

All we need is there...

I suggest we investigate creating STAC Items of the label extension type. Note : models per se are not STAC Items for now. I think there is an opportunity for us to think about how we could make that happen.

ymoisan avatar Aug 01 '19 20:08 ymoisan