torchgeo
torchgeo copied to clipboard
TileDatasets
- Introduces a new class of datasets called TileDatasets that are indexed by filename, xoffset, yoffset, and patch size.
- Implements samplers for these
- Implements a L7IrishDataModule using this scheme
Thanks for opening this proof of concept!
There's a few questions here:
- Do we want to add this base class to TorchGeo?
- Do we want all of our curated benchmark GeoDatasets to subclass this?
- Do we want to search for files or pass in a list of files?
My current opinions:
- I think we'll likely want something like this in TorchGeo
- This one is more nuanced. Using this for all benchmark GeoDatasets lets us avoid reprojection, but prevents us from combining those datasets with other GeoDatasets. We'll have to decide how important this functionality is. Depending on how much time I have this summer, I may try to tackle #409, which would make the gains less.
- I know you've been looking for something like this for a long time. I would like to be consistent here. Either Raster/Vector/Tile search for files, or get passed a list of files, or support both options.
Do we want to add this base class to TorchGeo?
Maybe -- although it feels like we should be able to get there by making RasterDataset and the Samplers more complex. I think I sketched out a method for allowing RasterDataset __getitem__ to take in two different types of bounding boxes, the current one, and the one used in this implementation (call these BoundingBox and Patch or something -- note you can convert from Patch to BoundingBox). If we're in a situation where there are overlapping datasets or something you can just stick with the current implementation.
Do we want all of our curated benchmark GeoDatasets to subclass this?
Depends on the first question
Do we want to search for files or pass in a list of files?
We've talked about this a bunch of times before -- I almost never want to search a file system and keep things that match a regex. The nice thing about this is that you can rename fns to uris and pass it lists of COG files on remote servers and it will work, while you definitely can't do that with glob.
I'd like to revive this one but maybe we can also make the samplers work with non georeferenced images. Some of the datasets we have like GID-15, LEVIR-CD, etc. are large images that I may want to sample smaller patches from for training but I don't want to manually preprocess them to a specific patch size beforehand in case I want to run an ablation by varying the patch size.