torchgeo
torchgeo copied to clipboard
Multi-bbox timeseries querying
This is a PR that is part of TorchGeo Timeseries support (#2382 - Return time series). The goal is to provide an interface that allows for the following options for querying:
sample = dataset[bbox]
sample = dataset[[bbox1, bbox2, bbox3]]
sample = dataset[([dataset1_bbox1], [dataset2_bbox1, dataset2_bbox2])]
The idea is that anything within a single bbox get's merged into a single raster, and that any query with multiple bboxes will stack along the time dimension. Querying with a tuple of (iterable) bboxes would split the subqueries to different datasets.
As long as this PR is in draft mode, I will keep test.ipynb
for anyone interested in helping to try out the new method.
Current status: The first two ways of querying has been implemented for the RasterDataset and tested with a small example.
What has changed:
- I've refactored quite a bit to reduce the complexity of
__init__
and__get_item__
methods.try_set_metadata, _get_bounds, _compile_and_check_filename_regex
and_populate_index
are examples of this. - Existing functionality to merge everything within a bbox has been moved in
__merge_single_bbox
, where the biggest difference is that I keep track of a dataframe for all regex metadata instead oflists
anddicts
. The main reason is that it allows to group filepaths per band easily, as well of keeping track of which dates went into which merged raster. More about that later. - Multi-bbox queries are stacked across a new dimension (t, c, h, w) by
__merge_query
. We need to agree if we want to go for timedimension 1 with non-temporal datasets. - Apart from the tensor with timeseries imagery, I am returning sample['dates'] in datetime format too. This is a list of dates that went into every single bbox. So in the multi-bbox case this will be
[[bbox1_t1, bbox1_t2, ...], [bbox2_t1]]
. I chose datetime format for now since that worked well for me in practice, but this can be converted to any format by the transforms. I was thinking that instead of a list of dates, maybe we could return a daterange or something, but that mostly depends on the downstream use. - The
filename_glob
has been relaxed, so that all files/bands end up in the dataset index. - A new class variable
nodata_value
has been added, anddrop_nodata
has been added to theinit
of the class. Setting the value to True will ignore any merged raster that contains nodata values. This came from using the class in practice with Sentinel2, and seeing that some timestamps contained black (parts of) imagery, since some sentinel tiles are not square. In theory, this could become a separate PR, but I chose to add it here, because the effect of nodata pixels becomes more pronounced with timeseries.
What is still left to do:
- Implement method 3, query with tuple (or raise an error if trying to index a "single" dataset with a tuple).
- Implement/copy querying strategy to other datasets.
- Pass all pre-commit checks.