raster-vision
raster-vision copied to clipboard
Train from STAC DatasetConfig
In order to train a user user must provide a DatasetConfig
ex:
https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L52-L59
This specifies:
- Source of training scenes
- Source of validation scenes
- Per scene label file location
- per scene raster location
In addition a ClassConfig
is required ex:
https://github.com/azavea/raster-vision/blob/c63b2ce21d20b8055fed4cd6595ab6a807523fdf/rastervision_pytorch_backend/rastervision/pytorch_backend/examples/tiny_spacenet.py#L23-L24
One possible source of above information is from a STAC catalog using label extension
There should be a library-level feature that conveniently provides a way to train a model from such a catalog with minimum configuration. Perhaps a StacDatasetConfig
.
Proposed scheme:
- Use STAC collections to group multiple STAC label items
- Specify which STAC collection to use for training data
- Read
ClassConfig
from STAC collection JSON - Use each label item to create label source
- Each label item links to either GeoJSON labels or raster labels
- Use each label item to create raster source
- Each label item links to source imagery to which labels apply
- Specify which STAC collection to use for training data
To clarify, you mean that there will be a STAC collection for the training scenes, and a separate STAC collection for validation scenes?
As I mentioned the other day, a simple way to implement this might be to have a to_dataset_config()
method that parses the underlying STACs and returns a standard DatasetConfig
that links to standard SceneConfig
objects. After that conversion happens, then everything else in Raster Vision can be implemented the same way as it currently is.
I started by creating a sample STAC catalog we can train from to think through the data formatting issues.
See https://gist.github.com/echeipesh/2d2a18b59d634ecbfd97b7d32bba6164
Note that image-
and label-
prefix in file names indicate that they would be in image/
and label/
subfolders if gist supported sub-folders. All the links are built as if those folders exist.
Separating the images and labels into their own collections is good and very uncontroversial. There is some discussion on how to achieve the training and testing split.
I wanted to avoid using a property that would tag each item as belonging to "testing" or "training" split since that would imply that there could be a single split without re-creating the catalog. So far I'm exploring having sub-catalogs that point to items of either set. Raster Vision could discover the items by being pointed to appropriate sub-catalog and crawling them. Of course each item still refers to its collection.
As you can see that ends up with situation that there are two ways to reach each item from root of the catalog. One path is to reach it through collection and another through the training split catalog. I believe this is "OK" but I'm seeking feedback on the idea as a whole (@lewfish).
Edit: After conversation with Lewis we decided it would be better if the training/testing catalog split was a parallel top level catalog that referenced items in the source catalog. This would avoid creating a convention to track multiple splits and would further insulate the source catalog from such changes. Something like:
I think the convention and logic of generating this split might still need some figuring, I'll spend some time to explore those options.