torchgeo
torchgeo copied to clipboard
Support Time Series Datasets and sampling for GeoDatasets
This PR aims to support time-series datasets and sampling. Following the discussion of #640 . I will try to explain the thought process hereafter, given my understanding of the existing torchgeo framework. The current sampling strategy I am considering is the following
ForecastingGeoSampler
- specifies a time range, and a duration for training and testing (e.g., train on one month of data, predict next week), then return pairs of inputs in an ordered sequential fashion
Limitation current RasterDataset:
- limitation: currently a
RasterDataset
accepts a bounding box that includes the time dimension but then calls the_merge_files
function where the time dimension is dropped in order to callrasterio.merge.merge
to merge overlapping tiles for a given geographical region. As a consequence, if there were multiple scences across time, this function just disregards the time dimension - approach: I created a new
TimeSeriesRasterDataset
which inheritsRasterDataset
and overwrites the_merge_files
method. Here, like before scenes are merged on their geographical location, but only after all retrieved scenes from a bounding box hit are grouped by their date, so that only geographical areas are merged per single time step. Afterwards, they are concatenated along a new dimension to form a 4D tensor that includes the time dimension and this 4D tensor is returned from the__getitem__
method
Sampling/DataLoader:
- limitation: to my understanding when using PyTorch
DataLoader
, one can either pass asampler
which gives instructions how to retrieve a single sample (by returning a single bounding box), or abatch_sampler
which would sample one single batch (by returning a list of bounding boxes). However, for time-series tasks we need "something in between" because following the single sampler, we need to return a tuple of bounding boxes that have the same geographic location but a sequential time duration (one for the input sequence, and one for the target sequence that is sequential in time to the input sequence). - idea: define a sampler which returns a tuple of bounding boxes (one for the input and one for the target sequence). To my understanding GeoDatasets will mostly be used in conjunction with IntersectionDatasets because they are not curated and hence only contain either inputs or targets (however one might define that for their use-case). If the sampler returns a tuple, then the
__getitem__
method of the Intersection dataset acceptsUnion[Bbox, Tuple[Bbox, Bbox]]
and retrieves the first input sequence from the "image" dataset and the second input sequence from the "mask" dataset
Current Issues:
- The retrieving of a possible geo bounding box is done through the rtree
self.index.intersection
, however, when the index is populated with scenes they have a single time-stamp, but when callingself.index.intersection
we would need to ensure that for a given geo location the hit contains all available time stamps that respect the input sequence and target sequence length that we are interested in - Following the idea of specifying a "continuous" time-range for the ForecastingGeoSampler like days, months, how do you ensure that within that input or target time-range that there are actually scenes that fall within this randomly chosen time range in the sampler because the scenes themselves are discrete time-stamp scenes. The approach I saw in the PyTorchForecasting package is that they define an increasing
time_idx
variable for all steps within a time-series and the define the input sequence and target sequence length as the number of time indices to select from time_idx.
This current draft is just one idea and by no means exhaustive. Would still need to think about all kinds of edge cases and testing procedures. I am sure I missed or misunderstood several things but hopefully it helps in thinking about a suitable implementation to support Time-Series tasks for GeoDatasets.
Sorry it took so long to review this. Here are my thoughts on your approach.
Sampler
The trickiest aspect of the sampler is that there are so many different ways someone might want to sample a dataset. Consider the following corner cases/situations:
- The user might want to specify a range of time to sample from, similar to spatial ROI
- The user might want to specify a train/val/test split across the time axis, by percent or by time range
- The user may want to do rolling window predictions (train on 3 days, predict next 24 hrs, shift 24 hrs)
- The user may want to work with cyclic data, such as only using data during the agricultural growing season, but for several years
- The user may require regular sequential data (every 16 days) or may allow irregular sequential data (as often as we have data)
- In the latter case, this means the dimensions may not match from one batch to the next, or within a single batch
- Depending on overlap of WRS tiles, some areas will have 4X the temporal coverage as others
- When using IntersectionDataset, some datasets may have different temporal resolution than others
It feels difficult if not impossible to support all of these use cases with a single sampler. We'll probably need multiple samplers, or a lot of configuration options. It would be good to see what other time-series libraries call these samplers and what options they have.
Dataset
The problem with your approach is that we would need a TimeSeriesRasterDataset
variant of every single RasterDataset
subclass (Landsat-9, Sentinel-2, etc.). I think we can pretty easily avoid this limitation. Here are a couple of alternative ideas:
- Change
RasterDataset.__getitem__
to accept multiple bounding boxes.
Basically, this would mean changing:
def __getitem__(self, query: BoundingBox) -> Dict[str, Any]:
to:
def __getitem__(self, *queries: BoundingBox) -> Union[Dict[str, Any], List[Dict[str, Any]]]:
If the sampler passes a single bounding box to the dataset, it behaves like it used to. If the sampler passes multiple bounding boxes, the dataset will return multiple separate sample dicts.
Added benefit: this would better support alternative sampling strategies like Tile2Vec or SeCo where we need to be able to return a (query, neighbor, distant) tuple per data point.
- Change
RasterDataset.__init__
to add asplit_time
(name TBD) parameter, and modify_merge_files
to not merge across time depending on this flag.
In this case, the sampler would still return a single BoundingBox covering the entire time span, but wouldn't merge any files if they are from different times. The dataset would return separate sampler dicts, or a single sampler dict with an extra dimension for time. This approach is limited since you can't pass multiple bounding boxes for different time splits, you just get the same data as before but without a time merger.
I have more thoughts on this (in addition to a time axis, we may want to optionally support a Z axis for 3D weather/climate data), but don't want to bog this PR down by thinking too broadly.
Thank you for your feedback. Your proposed integration/change within the RasterDataset
is a better solution, so I am working on that at the moment:
Dataset
- changing
_merge_files
I am working on - when changing the
__getitem__
method, I thought how this would be used with theIntersectionDataset
because for GeoDatasets, one could have the case that you specify an image and a mask/target dataset and take the Intersection over them to sample queries. Thus, I think we also have to think about changing the__getitem__
method of theIntersectionDatset
so that in a case of a time series prediction the first bounding box queries the image dataset and the second time sequential bounding box samples only the mask/target dataset. Thoughts on this?
Sampler
I agree that for all these specific cases different samplers could be a viable solution. I did some looking around more classical time-series libraries and how they handle datasets. This is a little summary from some libraries I found (not exhaustive, and I am not an experienced user so might not do them justice):
-
PyTorch Forecasting has a
TimeSeriesDataset
class. This is all tabular data and at a minimum, the class expects the dataframe to have atime_idx
column that specifies time steps and agroup_idx
which indicates which timeseries a data point belongs to. Additionally, one would specify which columns hold static or time varying and categorical/continuous variables.
value | group | time_idx |
---|---|---|
0.201798 | 0 | 0 |
0.389338 | 0 | 1 |
-0.285848 | 0 | 2 |
-0.044911 | 1 | 0 |
-0.127036 | 1 | 1 |
-0.495227 | 1 | 2 |
-0.343048 | 1 | 4 |
Missing time steps like in group 1 above can be filled on the fly if a flag is specified.
Additionally, one has to specify a encoder_length
that represents the length of the input sequence to a model and a decoder_length
that specifies the prediction target sequence length. Such time sequences are then sampled at random to generate input samples.
-
darts has a TimeSeries object which is also based on dataframes and similar principle as described above. Here there is an
input_chunck_length
andoutput_chunck_length
parameter that decides what sequences to sample. Their dataset class also constructs a sequentially increasingtime_idx
from which the starting point is randomly sampled to retrieve ainput_length + output_length
long sequence for training. -
GluonTS, is also relying on a pandas dataframe. For their dataset class they rely on a provided monotonically increasing timestep column. As far as I can tell, the sampling here also works as it does in the libraries abover, where randomly drawn sequences from the time series provided in the datafram are taken to train the model
From what I can tell, if you wanted to do a time-series prediction for the different tasks you have mentioned in #640 such as cycles, seasonal data, or any other given data selection processes that determines the input/output sequence, these libraries would rely on that you format your data on your own in such a way that it includes the data in your desired frequency and length.
I guess what we are trying to do in contrast is to have a fixed dataset instance on which all kinds of different tasks will be handled by sampling strategies that decided the time_idx
column, frequency and length. For remote sensing data it also seems more necessary to take the latter approach because it would be way more of a burdon to ask users to carefully construct their datasets from a collection of .tif
files which are more difficult to handle than just one dataframe that can be manipulated with pandas.
Edit
Looking into this again, and time-series prediction can in a sense also be framed as video prediction. Pytorchvideo defines some clip samplers that could serve as inspiration.
This last commit contains a proposal for changing RasterDataset
according to your two proposed changes. And I added some pytests for the dataset but not yet for the samplers.
Given the complexity of supporting all the different time-series scenario tasks, my idea is to first begin by focusing on a single one that should already be quiet powerful for standard tasks. This is to begin with a "RandomSequenceSampler" (name tbd) that randomly samples consecutive input and target sequences for a given input and target length during training, like
- RandomClipSampler in pytorchvideo
- SequentialTrainingData in darts,
- the default TimeSeriesDataset in pytorch-forecasting,
- or the default InstanceSplitter in gluonts.
I assume that a video by default has all possible time-steps (frames) in each video sequence, so not a video with blanks in between. That makes sampling sequences quiet straightforward in the video framework. For the time-series libraries the assumption is that the user provides a dataframe which includes a time-idx column specifiying the consecutive data points within a time-series. This also makes sampling index based sequences easier because one does not need to worry about whether the idx specify hours, days, months, etc.
-
However, for geospatial data, I don't think we can expect the user to provide information that would give us indices to timestamped files, so it needs to be handled by the GeoDataset and sampler. I am not an expert with the rtree, but maybe it would be reasonable to have a sampler argument
time_unit: str
(one of ["hour", "day", "month", "year"]) which gives us the time unit of a single idx step and then two more arguments forinput_window: int
andtarget_window : int
which specifies the length of the sequences. So for(time_unit="month", input_window=10, target_window=2)
one would get random sequences that let the model train on 10 month and predict the following 2 month. These would have to be converted to rtree indexable time format given the time unit. -
Additionally, I am not sure we can assume that given
(time_unit="month", input_window=10, target_window=2)
the available tiff images in the dataset actually respect the indices. For example, if I download Sentinel2 imagery from Copernicus for two years, then I don't think I can assume that for any given 10 month subsequence there are actually 10 images. If there are fewer, the standard time-series libraries have a flag that would do some nan interpolation and fill the gap with some schema (would we do something similar for raster data?). We might also have more images for a given time period and would have to somehow condense those into a representation of [10, 3, 256, 256] given the above example since many model architectures expect constant sequence lengths. -
Another question is whether we can assume that for a given dataset all geospatial locations have all time sequences of interest available. Or if we don't, how we properly check for the scenarios where this is not the case.
So I am not sure on the preferred way of handling this, but could also be overthinking it. Either way, interested in opinions!
The above commits contain a new design proposal that samples consecutive sequences for time-series forecasting tasks.
It contains the following new pieces:
-
ForecastDataset
: inherits fromIntersectionDataset
with the idea that the there are arguments forinput_dataset
from which we sample input sequences and atarget_dataset
from which we sample the corresponding target sequences. TheRasterDataset
is changed back to only accept a single bounding box, but the__getitem__
method ofForecastDataset
expects two bounding box querys (input and target) that are generated by the sampler -
SequentialGeoSampler
: that specifies an input sequence length, target sequence length, can restrict the time to sample from, and allows time units that are supposed to make it easier to specify input and target length
The idea behind the two is then to use it as follows:
class TimeSeriesInputRaster(RasterDataset):
filename_glob = "test_*.tif"
filename_regex = r"test_(?P<date>\d{8})_(?P<band>B0[234])"
date_format = "%Y%m%d"
is_image = True
separate_files = True
all_bands = ["B02", "B03", "B04"]
class TimeSeriesTargetRaster(RasterDataset):
filename_glob = "test_*.tif"
filename_regex = r"test_(?P<date>\d{8})_(?P<band>target)"
date_format = "%Y%m%d"
is_image = True
separate_files = True
all_bands = ["target"]
ds = ForecastDataset(
input_dataset=TimeSeriesInputRaster(root, as_time_series=True),
target_dataset=TimeSeriesTargetRaster(root, as_time_series=True)
)
sampler = SequentialGeoSampler(
ds,
sample_window=10,
target_window=2,
time_unit="months",
)
sample = ds[next(iter(sampler))]
input = sample["image"] # [sample_window, num_bands, height, width]
target = sample["target"] # [target_window, num_targets, height, width]
This would be used to learn a sequence of 10 months and predict the next two months.
A design choice I have not implemented yet is how to do sampling hits with self.index.intersection(roi)
. Given both the spatial and time dimension, the number of samples could be huge given a patch_size. Questions about this and beyond:
- Would you first generate all possible time sequence hits in a rolling window fashion and then from each time sequence draw random geo location samples? Or the other way around and are there recommended approaches with the rtree?
- Either way, it would probably be useful to specify a
max_number_of_samples
parameter that gives some control over the number of samples like dartsnum_samples_per_ts
parameter? - Assumptions about the available time steps of data. Do we handle gaps and/or two many available time steps that could potentially gathered by a bbox sample?
I would appreciate feedback on this and also want to mention the following questions for later on:
- Dataset splitting training, val, test. I think it would be amazing if we could provide some utilities for that as well because that is a pain to deal with with the added time dimension
-
ForecastingTask
Class with basic models like Conv-LSTM, UNET-LSTM, some TimeSpatialTransformer
Maybe it is useful to determine some terminology that can be used for models, datasets, samplers and tasks for the context of time-series prediction. These are just a few terms encountered so far.
- input sequence length:
input_chunk_length
in darts,encoder_length
in pytorch-forecasting, form ofsequence_length
in gluon-ts, 'window' is also sometimes encountered - prediction sequence length:
output_chunk_length
in darts,prediction_length
in pytorch-forecasting,prediction_length
in gluon-ts - what to call the keys in the data sample returned by the dataloader:
-
input
for the input sequence images, andtarget
for the target sequence images? - prepend the bbox and crs data with the above respectively?
-
I am not an expert with rtree but I think there might be a couple of things that could be changed.
- At the moment, the
filename_regex
is only supposed to pick up a single band withseparate_files=True
, otherwise the length of the dataset is confusing. However, now we need some control over extracting the date of every individual filename, but there won't be a match if the regex does not cover all band_names in__getitem__
- I don't quiet understand how the
for hit in self.index.intersection(tuple(self.roi), objects=True):
line in the samplers works, because the number of samples is also dependent on the number of items in the index? And I think there is not a way to haveself.index.intersection
sample only sequences of a given time duration so there are extra steps required
@nilsleh Please may I know how to use these changes for a time series segmentation task?
I have a prechipped raster dataset with observations throughout the year, and a vector dataset for the segmentation labels. I'm trying to use the new RasterDataset
logic alongside the SequentialDataSampler
but I keep getting ValueError: 'a' cannot be empty unless no samples are taken
.
Do you have a brief worked example to share?
Thanks
Hi @Spiruel , this PR is still ongoing work and is not rigorously tested yet and also a work in progress with some unanswered design choices so very likely to change, maybe even substantially. However, it's exciting to hear that you are looking to use it for your task at hand and I will try to help out. I would also be grateful for some feedback because I have only tested it on a personal project (that I cannot share), while there are of course many more different use cases out there.
From your error message, I am guessing that it occurs in this line but to make sure, could you post a longer error trace? The error here would mean that during the __init__
call of the sampler - around here there were not any regions and/or timestamps found that could serve as samples. This could have multiple reasons but would be my starting point.
In my setup the dataloading and sampling looks like in the comment above.
Maybe you can check that you can retrieve samples from your RasterDataset
class dataset without the SequentialGeoSampler
and also of the ForecastDataset
? Common issues there can be the filename_glob
and regex expression needed for the time. I hope this gives some pointers, but let me know.
Thanks - I'd be happy to offer feedback on this as I test it out on my use case and get my head around it. I've got a number of crop classification use cases that I should be able to try out once I get this working.
Yes, my error does occur on this line and I'll attempt to debug this further. There should be plenty of regions and timestamps to serve as samples.
I followed your setup above and adapted it to my time series segmentation task, based on my understanding. I am able to retrieve samples from my input_ds
as normal using samplers like RandomGeoSampler
and I get the correct spatial and temporal information (I checked the bbox, mint, maxt attributes etc and all looks good). However when I try and use SequentialGeoSampler
to get time series, I get the error mentioned above.
#prechipped harmonised landsat-sentinel rasters across a calendar year
class TimeSeriesInputRaster(RasterDataset):
filename_glob = "*.tif"
filename_regex = r"^\d+_\d+_(?P<date>\d{8})T?.tif$"
date_format = "%Y%m%d"
is_image = True
separate_files = False
all_bands = ["B02", "B03", "B04"]
#yearly cover crop masks for oct-apr
class TimeSeriesTargetRaster(RasterDataset):
filename_glob = "*_cc2class.tif"
filename_regex = r"""
^(?P<date>\d+)
_cc2class.*$
"""
date_format = "%Y"
is_image = True
separate_files = False
all_bands = ["target"]
input_root = "S30/4"
label_root = "data"
input_ds = TimeSeriesInputRaster(input_root, as_time_series=True)
label_ds = TimeSeriesTargetRaster(label_root, as_time_series=True)
print(len(input_ds))
#don't need forecast dataset in this example
ds = ForecastDataset(
input_dataset=input_ds,
target_dataset=label_ds
)
sampler = SequentialGeoSampler(
input_ds,
encoder_length=3,
prediction_length=1,
time_unit="weeks",
size=32,
length=10
)
sample = ds[next(iter(sampler))]
input = sample["image"] # [sample_window, num_bands, height, width]
target = sample["target"] # [target_window, num_targets, height, width]
#expect input features of size [3, 14, 32, 32] and corresponding label of size [1, 2, 32, 32]
Let me know if this discussion is best had on this thread, or whether I should move it elsewhere! Thanks
Edit: Continued in #1544