torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Support Time Series Datasets and sampling for GeoDatasets

Open nilsleh opened this issue 2 years ago • 10 comments

This PR aims to support time-series datasets and sampling. Following the discussion of #640 . I will try to explain the thought process hereafter, given my understanding of the existing torchgeo framework. The current sampling strategy I am considering is the following

ForecastingGeoSampler

  • specifies a time range, and a duration for training and testing (e.g., train on one month of data, predict next week), then return pairs of inputs in an ordered sequential fashion

Limitation current RasterDataset:

  • limitation: currently a RasterDataset accepts a bounding box that includes the time dimension but then calls the _merge_files function where the time dimension is dropped in order to call rasterio.merge.merge to merge overlapping tiles for a given geographical region. As a consequence, if there were multiple scences across time, this function just disregards the time dimension
  • approach: I created a new TimeSeriesRasterDataset which inherits RasterDataset and overwrites the _merge_files method. Here, like before scenes are merged on their geographical location, but only after all retrieved scenes from a bounding box hit are grouped by their date, so that only geographical areas are merged per single time step. Afterwards, they are concatenated along a new dimension to form a 4D tensor that includes the time dimension and this 4D tensor is returned from the __getitem__ method

Sampling/DataLoader:

  • limitation: to my understanding when using PyTorch DataLoader, one can either pass a sampler which gives instructions how to retrieve a single sample (by returning a single bounding box), or a batch_sampler which would sample one single batch (by returning a list of bounding boxes). However, for time-series tasks we need "something in between" because following the single sampler, we need to return a tuple of bounding boxes that have the same geographic location but a sequential time duration (one for the input sequence, and one for the target sequence that is sequential in time to the input sequence).
  • idea: define a sampler which returns a tuple of bounding boxes (one for the input and one for the target sequence). To my understanding GeoDatasets will mostly be used in conjunction with IntersectionDatasets because they are not curated and hence only contain either inputs or targets (however one might define that for their use-case). If the sampler returns a tuple, then the __getitem__ method of the Intersection dataset accepts Union[Bbox, Tuple[Bbox, Bbox]] and retrieves the first input sequence from the "image" dataset and the second input sequence from the "mask" dataset

Current Issues:

  • The retrieving of a possible geo bounding box is done through the rtree self.index.intersection, however, when the index is populated with scenes they have a single time-stamp, but when calling self.index.intersection we would need to ensure that for a given geo location the hit contains all available time stamps that respect the input sequence and target sequence length that we are interested in
  • Following the idea of specifying a "continuous" time-range for the ForecastingGeoSampler like days, months, how do you ensure that within that input or target time-range that there are actually scenes that fall within this randomly chosen time range in the sampler because the scenes themselves are discrete time-stamp scenes. The approach I saw in the PyTorchForecasting package is that they define an increasing time_idx variable for all steps within a time-series and the define the input sequence and target sequence length as the number of time indices to select from time_idx.

This current draft is just one idea and by no means exhaustive. Would still need to think about all kinds of edge cases and testing procedures. I am sure I missed or misunderstood several things but hopefully it helps in thinking about a suitable implementation to support Time-Series tasks for GeoDatasets.

nilsleh avatar Oct 30 '22 10:10 nilsleh

Sorry it took so long to review this. Here are my thoughts on your approach.

Sampler

The trickiest aspect of the sampler is that there are so many different ways someone might want to sample a dataset. Consider the following corner cases/situations:

  • The user might want to specify a range of time to sample from, similar to spatial ROI
  • The user might want to specify a train/val/test split across the time axis, by percent or by time range
  • The user may want to do rolling window predictions (train on 3 days, predict next 24 hrs, shift 24 hrs)
  • The user may want to work with cyclic data, such as only using data during the agricultural growing season, but for several years
  • The user may require regular sequential data (every 16 days) or may allow irregular sequential data (as often as we have data)
  • In the latter case, this means the dimensions may not match from one batch to the next, or within a single batch
  • Depending on overlap of WRS tiles, some areas will have 4X the temporal coverage as others
  • When using IntersectionDataset, some datasets may have different temporal resolution than others

It feels difficult if not impossible to support all of these use cases with a single sampler. We'll probably need multiple samplers, or a lot of configuration options. It would be good to see what other time-series libraries call these samplers and what options they have.

Dataset

The problem with your approach is that we would need a TimeSeriesRasterDataset variant of every single RasterDataset subclass (Landsat-9, Sentinel-2, etc.). I think we can pretty easily avoid this limitation. Here are a couple of alternative ideas:

  1. Change RasterDataset.__getitem__ to accept multiple bounding boxes.

Basically, this would mean changing:

def __getitem__(self, query: BoundingBox) -> Dict[str, Any]:

to:

def __getitem__(self, *queries: BoundingBox) -> Union[Dict[str, Any], List[Dict[str, Any]]]:

If the sampler passes a single bounding box to the dataset, it behaves like it used to. If the sampler passes multiple bounding boxes, the dataset will return multiple separate sample dicts.

Added benefit: this would better support alternative sampling strategies like Tile2Vec or SeCo where we need to be able to return a (query, neighbor, distant) tuple per data point.

  1. Change RasterDataset.__init__ to add a split_time (name TBD) parameter, and modify _merge_files to not merge across time depending on this flag.

In this case, the sampler would still return a single BoundingBox covering the entire time span, but wouldn't merge any files if they are from different times. The dataset would return separate sampler dicts, or a single sampler dict with an extra dimension for time. This approach is limited since you can't pass multiple bounding boxes for different time splits, you just get the same data as before but without a time merger.

I have more thoughts on this (in addition to a time axis, we may want to optionally support a Z axis for 3D weather/climate data), but don't want to bog this PR down by thinking too broadly.

adamjstewart avatar Nov 18 '22 19:11 adamjstewart

Thank you for your feedback. Your proposed integration/change within the RasterDataset is a better solution, so I am working on that at the moment:

Dataset

  • changing _merge_files I am working on
  • when changing the __getitem__ method, I thought how this would be used with the IntersectionDataset because for GeoDatasets, one could have the case that you specify an image and a mask/target dataset and take the Intersection over them to sample queries. Thus, I think we also have to think about changing the __getitem__ method of the IntersectionDatset so that in a case of a time series prediction the first bounding box queries the image dataset and the second time sequential bounding box samples only the mask/target dataset. Thoughts on this?

Sampler

I agree that for all these specific cases different samplers could be a viable solution. I did some looking around more classical time-series libraries and how they handle datasets. This is a little summary from some libraries I found (not exhaustive, and I am not an experienced user so might not do them justice):

  • PyTorch Forecasting has a TimeSeriesDataset class. This is all tabular data and at a minimum, the class expects the dataframe to have a time_idx column that specifies time steps and a group_idx which indicates which timeseries a data point belongs to. Additionally, one would specify which columns hold static or time varying and categorical/continuous variables.
value group time_idx
0.201798 0 0
0.389338 0 1
-0.285848 0 2
-0.044911 1 0
-0.127036 1 1
-0.495227 1 2
-0.343048 1 4

Missing time steps like in group 1 above can be filled on the fly if a flag is specified.

Additionally, one has to specify a encoder_length that represents the length of the input sequence to a model and a decoder_length that specifies the prediction target sequence length. Such time sequences are then sampled at random to generate input samples.

  • darts has a TimeSeries object which is also based on dataframes and similar principle as described above. Here there is an input_chunck_lengthand output_chunck_length parameter that decides what sequences to sample. Their dataset class also constructs a sequentially increasing time_idx from which the starting point is randomly sampled to retrieve a input_length + output_lengthlong sequence for training.

  • GluonTS, is also relying on a pandas dataframe. For their dataset class they rely on a provided monotonically increasing timestep column. As far as I can tell, the sampling here also works as it does in the libraries abover, where randomly drawn sequences from the time series provided in the datafram are taken to train the model

From what I can tell, if you wanted to do a time-series prediction for the different tasks you have mentioned in #640 such as cycles, seasonal data, or any other given data selection processes that determines the input/output sequence, these libraries would rely on that you format your data on your own in such a way that it includes the data in your desired frequency and length.

I guess what we are trying to do in contrast is to have a fixed dataset instance on which all kinds of different tasks will be handled by sampling strategies that decided the time_idx column, frequency and length. For remote sensing data it also seems more necessary to take the latter approach because it would be way more of a burdon to ask users to carefully construct their datasets from a collection of .tif files which are more difficult to handle than just one dataframe that can be manipulated with pandas.

Edit

Looking into this again, and time-series prediction can in a sense also be framed as video prediction. Pytorchvideo defines some clip samplers that could serve as inspiration.

nilsleh avatar Nov 21 '22 08:11 nilsleh

This last commit contains a proposal for changing RasterDataset according to your two proposed changes. And I added some pytests for the dataset but not yet for the samplers.

nilsleh avatar Nov 22 '22 11:11 nilsleh

Given the complexity of supporting all the different time-series scenario tasks, my idea is to first begin by focusing on a single one that should already be quiet powerful for standard tasks. This is to begin with a "RandomSequenceSampler" (name tbd) that randomly samples consecutive input and target sequences for a given input and target length during training, like

I assume that a video by default has all possible time-steps (frames) in each video sequence, so not a video with blanks in between. That makes sampling sequences quiet straightforward in the video framework. For the time-series libraries the assumption is that the user provides a dataframe which includes a time-idx column specifiying the consecutive data points within a time-series. This also makes sampling index based sequences easier because one does not need to worry about whether the idx specify hours, days, months, etc.

  1. However, for geospatial data, I don't think we can expect the user to provide information that would give us indices to timestamped files, so it needs to be handled by the GeoDataset and sampler. I am not an expert with the rtree, but maybe it would be reasonable to have a sampler argument time_unit: str (one of ["hour", "day", "month", "year"]) which gives us the time unit of a single idx step and then two more arguments for input_window: int and target_window : int which specifies the length of the sequences. So for (time_unit="month", input_window=10, target_window=2) one would get random sequences that let the model train on 10 month and predict the following 2 month. These would have to be converted to rtree indexable time format given the time unit.

  2. Additionally, I am not sure we can assume that given (time_unit="month", input_window=10, target_window=2) the available tiff images in the dataset actually respect the indices. For example, if I download Sentinel2 imagery from Copernicus for two years, then I don't think I can assume that for any given 10 month subsequence there are actually 10 images. If there are fewer, the standard time-series libraries have a flag that would do some nan interpolation and fill the gap with some schema (would we do something similar for raster data?). We might also have more images for a given time period and would have to somehow condense those into a representation of [10, 3, 256, 256] given the above example since many model architectures expect constant sequence lengths.

  3. Another question is whether we can assume that for a given dataset all geospatial locations have all time sequences of interest available. Or if we don't, how we properly check for the scenarios where this is not the case.

So I am not sure on the preferred way of handling this, but could also be overthinking it. Either way, interested in opinions!

nilsleh avatar Jan 21 '23 17:01 nilsleh

The above commits contain a new design proposal that samples consecutive sequences for time-series forecasting tasks.

It contains the following new pieces:

  • ForecastDataset: inherits from IntersectionDataset with the idea that the there are arguments for input_dataset from which we sample input sequences and a target_dataset from which we sample the corresponding target sequences. The RasterDataset is changed back to only accept a single bounding box, but the __getitem__ method of ForecastDataset expects two bounding box querys (input and target) that are generated by the sampler
  • SequentialGeoSampler: that specifies an input sequence length, target sequence length, can restrict the time to sample from, and allows time units that are supposed to make it easier to specify input and target length

The idea behind the two is then to use it as follows:

class TimeSeriesInputRaster(RasterDataset):
    filename_glob = "test_*.tif"
    filename_regex = r"test_(?P<date>\d{8})_(?P<band>B0[234])"
    date_format = "%Y%m%d"
    is_image = True
    separate_files = True
    all_bands = ["B02", "B03", "B04"]

class TimeSeriesTargetRaster(RasterDataset):
    filename_glob = "test_*.tif"
    filename_regex = r"test_(?P<date>\d{8})_(?P<band>target)"
    date_format = "%Y%m%d"
    is_image = True
    separate_files = True
    all_bands = ["target"]

ds =  ForecastDataset(
    input_dataset=TimeSeriesInputRaster(root, as_time_series=True),
    target_dataset=TimeSeriesTargetRaster(root, as_time_series=True)
)

sampler = SequentialGeoSampler(
    ds,
    sample_window=10,
    target_window=2,
    time_unit="months",
)

sample = ds[next(iter(sampler))]
input = sample["image"] # [sample_window, num_bands, height, width]
target = sample["target"] # [target_window, num_targets, height, width]

This would be used to learn a sequence of 10 months and predict the next two months.

A design choice I have not implemented yet is how to do sampling hits with self.index.intersection(roi). Given both the spatial and time dimension, the number of samples could be huge given a patch_size. Questions about this and beyond:

  1. Would you first generate all possible time sequence hits in a rolling window fashion and then from each time sequence draw random geo location samples? Or the other way around and are there recommended approaches with the rtree?
  2. Either way, it would probably be useful to specify a max_number_of_samples parameter that gives some control over the number of samples like darts num_samples_per_ts parameter?
  3. Assumptions about the available time steps of data. Do we handle gaps and/or two many available time steps that could potentially gathered by a bbox sample?

I would appreciate feedback on this and also want to mention the following questions for later on:

  1. Dataset splitting training, val, test. I think it would be amazing if we could provide some utilities for that as well because that is a pain to deal with with the added time dimension
  2. ForecastingTask Class with basic models like Conv-LSTM, UNET-LSTM, some TimeSpatialTransformer

nilsleh avatar Jan 23 '23 20:01 nilsleh

Maybe it is useful to determine some terminology that can be used for models, datasets, samplers and tasks for the context of time-series prediction. These are just a few terms encountered so far.

  • input sequence length: input_chunk_length in darts, encoder_length in pytorch-forecasting, form of sequence_length in gluon-ts, 'window' is also sometimes encountered
  • prediction sequence length: output_chunk_length in darts, prediction_length in pytorch-forecasting, prediction_length in gluon-ts
  • what to call the keys in the data sample returned by the dataloader:
    • input for the input sequence images, and target for the target sequence images?
    • prepend the bbox and crs data with the above respectively?

nilsleh avatar Jan 26 '23 14:01 nilsleh

I am not an expert with rtree but I think there might be a couple of things that could be changed.

  1. At the moment, the filename_regex is only supposed to pick up a single band with separate_files=True, otherwise the length of the dataset is confusing. However, now we need some control over extracting the date of every individual filename, but there won't be a match if the regex does not cover all band_names in __getitem__
  2. I don't quiet understand how the for hit in self.index.intersection(tuple(self.roi), objects=True): line in the samplers works, because the number of samples is also dependent on the number of items in the index? And I think there is not a way to have self.index.intersection sample only sequences of a given time duration so there are extra steps required

nilsleh avatar Jan 26 '23 14:01 nilsleh

@nilsleh Please may I know how to use these changes for a time series segmentation task?

I have a prechipped raster dataset with observations throughout the year, and a vector dataset for the segmentation labels. I'm trying to use the new RasterDataset logic alongside the SequentialDataSampler but I keep getting ValueError: 'a' cannot be empty unless no samples are taken.

Do you have a brief worked example to share?

Thanks

Spiruel avatar Sep 04 '23 12:09 Spiruel

Hi @Spiruel , this PR is still ongoing work and is not rigorously tested yet and also a work in progress with some unanswered design choices so very likely to change, maybe even substantially. However, it's exciting to hear that you are looking to use it for your task at hand and I will try to help out. I would also be grateful for some feedback because I have only tested it on a personal project (that I cannot share), while there are of course many more different use cases out there.

From your error message, I am guessing that it occurs in this line but to make sure, could you post a longer error trace? The error here would mean that during the __init__ call of the sampler - around here there were not any regions and/or timestamps found that could serve as samples. This could have multiple reasons but would be my starting point.

In my setup the dataloading and sampling looks like in the comment above.

Maybe you can check that you can retrieve samples from your RasterDataset class dataset without the SequentialGeoSampler and also of the ForecastDataset? Common issues there can be the filename_glob and regex expression needed for the time. I hope this gives some pointers, but let me know.

nilsleh avatar Sep 04 '23 12:09 nilsleh

Thanks - I'd be happy to offer feedback on this as I test it out on my use case and get my head around it. I've got a number of crop classification use cases that I should be able to try out once I get this working.

Yes, my error does occur on this line and I'll attempt to debug this further. There should be plenty of regions and timestamps to serve as samples.

I followed your setup above and adapted it to my time series segmentation task, based on my understanding. I am able to retrieve samples from my input_ds as normal using samplers like RandomGeoSampler and I get the correct spatial and temporal information (I checked the bbox, mint, maxt attributes etc and all looks good). However when I try and use SequentialGeoSampler to get time series, I get the error mentioned above.

#prechipped harmonised landsat-sentinel rasters  across a calendar year
class TimeSeriesInputRaster(RasterDataset):
    filename_glob = "*.tif"
    filename_regex = r"^\d+_\d+_(?P<date>\d{8})T?.tif$"
    date_format = "%Y%m%d"
    is_image = True
    separate_files = False
    all_bands = ["B02", "B03", "B04"]

#yearly cover crop masks for oct-apr
class TimeSeriesTargetRaster(RasterDataset):
    filename_glob = "*_cc2class.tif"
    filename_regex = r"""
        ^(?P<date>\d+)
        _cc2class.*$
    """
    date_format = "%Y"
    is_image = True
    separate_files = False
    all_bands = ["target"]

input_root = "S30/4"
label_root = "data"

input_ds = TimeSeriesInputRaster(input_root, as_time_series=True)
label_ds = TimeSeriesTargetRaster(label_root, as_time_series=True)

print(len(input_ds))

#don't need forecast dataset in this example
ds =  ForecastDataset(
    input_dataset=input_ds,
    target_dataset=label_ds
)

sampler = SequentialGeoSampler(
    input_ds,
    encoder_length=3,
    prediction_length=1,
    time_unit="weeks",
    size=32,
    length=10
)

sample = ds[next(iter(sampler))]
input = sample["image"] # [sample_window, num_bands, height, width]
target = sample["target"] # [target_window, num_targets, height, width]

#expect input features of size [3, 14, 32, 32] and corresponding label of size [1, 2, 32, 32]

Let me know if this discussion is best had on this thread, or whether I should move it elsewhere! Thanks

Edit: Continued in #1544

Spiruel avatar Sep 04 '23 18:09 Spiruel