torchgeo
torchgeo copied to clipboard
MLHub dataset
(this is similar in sprirt to https://github.com/microsoft/torchgeo/issues/403)
Radiant Earth's MLHub is a repository for geospatial ML datasets. We currently have created VisionDatasets for a few of the datasets that they host (we use their API to download the entire dataset archive to local disk), e.g.: NASAMarineDebris, BeninSmallHolderCashews, CV4AKenyaCropType, TropicalCycloneWindEstimation.
In addition to storing the entire archive of a dataset (as a zip file or similar), Radiant Earth have formatted each dataset as a STAC Collection (or several STAC Collections). They are also currently developing a way that users can download just the STAC metadata associated with each dataset. This would allow users to more easily subset very large datasets (e.g. BigEarthNet) without having to download the entire dataset. We'd like to create a generic MLHubDataset(...) that uses this upcoming feature to build a GeoDataset for any dataset hosted on MLHub. The rough idea is as follows:
- Download the STAC Collection files associated with a requested dataset
- Build a RasterDataset, creating the index using the metadata from each item
- For datasets that have both imagery and labels, the label items have a pointer to their corresponding imagery item
- In
__getitem__we can download the necessary files on the fly and cache them to disk
A reasonable signature for the constructor would be something like MLHubDataset(root="data/", collection_name, max_cache_size=None).
Similar to #403, this would require new dependencies for working with STAC.
If you are interested in working on this, I can send you an example "STAC metadata archive" that corresponds to the LandCoverNet dataset.
Just being able to easily convert all existing MLHub datasets from VisionDataset to GeoDataset would be a huge win!
MLHub is dead, long live Source Cooperative!