torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Multi-bbox timeseries querying

Open sfalkena opened this issue 3 months ago • 1 comments

This is a PR that is part of TorchGeo Timeseries support (#2382 - Return time series). The goal is to provide an interface that allows for the following options for querying:

sample = dataset[bbox] sample = dataset[[bbox1, bbox2, bbox3]] sample = dataset[([dataset1_bbox1], [dataset2_bbox1, dataset2_bbox2])]

The idea is that anything within a single bbox get's merged into a single raster, and that any query with multiple bboxes will stack along the time dimension. Querying with a tuple of (iterable) bboxes would split the subqueries to different datasets.

As long as this PR is in draft mode, I will keep test.ipynb for anyone interested in helping to try out the new method.

Current status: The first two ways of querying has been implemented for the RasterDataset and tested with a small example.

What has changed:

  • I've refactored quite a bit to reduce the complexity of __init__ and __get_item__ methods. try_set_metadata, _get_bounds, _compile_and_check_filename_regex and _populate_index are examples of this.
  • Existing functionality to merge everything within a bbox has been moved in __merge_single_bbox, where the biggest difference is that I keep track of a dataframe for all regex metadata instead of lists and dicts. The main reason is that it allows to group filepaths per band easily, as well of keeping track of which dates went into which merged raster. More about that later.
  • Multi-bbox queries are stacked across a new dimension (t, c, h, w) by __merge_query. We need to agree if we want to go for timedimension 1 with non-temporal datasets.
  • Apart from the tensor with timeseries imagery, I am returning sample['dates'] in datetime format too. This is a list of dates that went into every single bbox. So in the multi-bbox case this will be [[bbox1_t1, bbox1_t2, ...], [bbox2_t1]]. I chose datetime format for now since that worked well for me in practice, but this can be converted to any format by the transforms. I was thinking that instead of a list of dates, maybe we could return a daterange or something, but that mostly depends on the downstream use.
  • The filename_glob has been relaxed, so that all files/bands end up in the dataset index.
  • A new class variable nodata_value has been added, and drop_nodata has been added to the init of the class. Setting the value to True will ignore any merged raster that contains nodata values. This came from using the class in practice with Sentinel2, and seeing that some timestamps contained black (parts of) imagery, since some sentinel tiles are not square. In theory, this could become a separate PR, but I chose to add it here, because the effect of nodata pixels becomes more pronounced with timeseries.

What is still left to do:

  • Implement method 3, query with tuple (or raise an error if trying to index a "single" dataset with a tuple).
  • Implement/copy querying strategy to other datasets.
  • Pass all pre-commit checks.

sfalkena avatar Nov 13 '24 13:11 sfalkena