torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

TileDatasets

Open calebrob6 opened this issue 2 years ago • 12 comments

  • Introduces a new class of datasets called TileDatasets that are indexed by filename, xoffset, yoffset, and patch size.
  • Implements samplers for these
  • Implements a L7IrishDataModule using this scheme

calebrob6 avatar May 20 '23 21:05 calebrob6

Thanks for opening this proof of concept!

There's a few questions here:

  1. Do we want to add this base class to TorchGeo?
  2. Do we want all of our curated benchmark GeoDatasets to subclass this?
  3. Do we want to search for files or pass in a list of files?

My current opinions:

  1. I think we'll likely want something like this in TorchGeo
  2. This one is more nuanced. Using this for all benchmark GeoDatasets lets us avoid reprojection, but prevents us from combining those datasets with other GeoDatasets. We'll have to decide how important this functionality is. Depending on how much time I have this summer, I may try to tackle #409, which would make the gains less.
  3. I know you've been looking for something like this for a long time. I would like to be consistent here. Either Raster/Vector/Tile search for files, or get passed a list of files, or support both options.

adamjstewart avatar May 21 '23 03:05 adamjstewart

Do we want to add this base class to TorchGeo?

Maybe -- although it feels like we should be able to get there by making RasterDataset and the Samplers more complex. I think I sketched out a method for allowing RasterDataset __getitem__ to take in two different types of bounding boxes, the current one, and the one used in this implementation (call these BoundingBox and Patch or something -- note you can convert from Patch to BoundingBox). If we're in a situation where there are overlapping datasets or something you can just stick with the current implementation.

Do we want all of our curated benchmark GeoDatasets to subclass this?

Depends on the first question

Do we want to search for files or pass in a list of files?

We've talked about this a bunch of times before -- I almost never want to search a file system and keep things that match a regex. The nice thing about this is that you can rename fns to uris and pass it lists of COG files on remote servers and it will work, while you definitely can't do that with glob.

calebrob6 avatar May 21 '23 04:05 calebrob6

I'd like to revive this one but maybe we can also make the samplers work with non georeferenced images. Some of the datasets we have like GID-15, LEVIR-CD, etc. are large images that I may want to sample smaller patches from for training but I don't want to manually preprocess them to a specific patch size beforehand in case I want to run an ablation by varying the patch size.

isaaccorley avatar Dec 17 '23 03:12 isaaccorley