torchgeo Add Digital typhoon dataset

This PR adds the Digital Typhoon Dataset.

The implementation allows the following features:

create an input sequence of single channel images concatenated along channel dimension for nowcasting task (predicting label of last image in that sequence)
filter samples by min or max feature values
datamodule that lets you split by storm id (disjoint sets over the time domain) or over the time domain (disjoint set of storm ids)

TODO:

Target Normalization for regression task

Sample Image:

Nov 30 '23 21:11 nilsleh

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

Dec 02 '23 06:12 calebrob6

This is really cool! I wonder if there is any generalization between this and the Cyclone dataset

Stay tuned:)

Dec 02 '23 07:12 nilsleh

@adamjstewart not sure how I can fix the read the docs error, do I need to add the TypedDict to init?

Dec 18 '23 08:12 nilsleh

This RtD error means that:

It's trying to document the data module class
And add a link to where the return type is documented
But the SampleSequenceDict class itself doesn't appear in the docs

Some options:

Make split_dataset a hidden method so it doesn't appear in the docs
Use Any instead
Add SampleSequenceDict to the docs

I'm leaning towards 1. What are your thoughts?

Dec 21 '23 17:12 adamjstewart

Thanks, yeah option 1 makes sense, because the TypedDict is nicer for understanding the code.

Dec 21 '23 17:12 nilsleh

@nilsleh when you find the time we should finish this up.

Jan 15 '24 13:01 adamjstewart

@adamjstewart I tried adressing your suggestions to finish this up. But wanted to get your thoughts on the following:

There is actually a complication that I am not entirely sure how to handle regarding target normalization.

I think having target normalization out of the box for regression tasks is a nice feature because it's a bit annoying to handle it yourself because you have to collect the targets yourself from the relevant sources inside the dataset or datamodule and overwrite some methods. for example in Tropical Cyclone dataset this would actually be a nice feature
however, this dataset does not have a predefined train/test split from the authors (unlike Tropical Cyclone) and we only implement a random split in the datamodule so people can use the dataset more easily
this implies that computing the target statistics over the entire dataset is technically information leakage to the test set

Feb 08 '24 08:02 nilsleh

Don't think I've ever used target normalization before, but if you have a random train/val/test split, you can either:

Use a fixed seed so that it's the same split every time, calculate stats only on train
Generate a random split, save it to disk, distribute on HF and combine with the dataset

1 is much easier, 2 is more formal.

Feb 08 '24 09:02 adamjstewart

In case 1 the normalization would only be available throughtthe Datamodule and you would have to implement it in the on after batch transform and it would not be available for the dataset class

Case 2 is not possible I think, because the target range will change depending on the args for min_feature_value and max_feature_value

Feb 08 '24 10:02 nilsleh

In both cases you can manually compute the normalization, then copy-n-paste it and store it in the dataset class. You don't need to compute it on the fly during training. The great thing about data modules is that they don't change too much. If there are parameters that select what features you are using, this is similar to which bands are used in the So2SatDataModule.

Feb 08 '24 10:02 adamjstewart

But in case 2 I actually need to compute it on the fly because the regression target range changes based on the range restriction.

Feb 08 '24 10:02 nilsleh

How is that different from case 1? The only thing that changes is whether the split is recorded on disk or not.

Feb 08 '24 10:02 adamjstewart

It's not different from case 1. Since the target range can change and there are no defined train/test splits on the dataset class level, the target normalization needs to happen in the datamodule. But that would imply that the normalization needs to move to datamodule instead of dataset (where it is currently). So I just wanted to inquire about that :)

Feb 08 '24 11:02 nilsleh

Gotcha. Yeah I would definitely move all transforms/data aug from the dataset to the data module to match our other datasets. Lack of a pre-defined train/val/test split shouldn't matter. Allowing the user to specify min/max feature values does matter. Do we need that? Can't we just compute that based on train and store it permanently with no option to override? Is that designed to serve the same purpose as mean/std?

Feb 08 '24 11:02 adamjstewart

These cyclone dataset usually have a very imbalanced target distribution (there are few images of high hurricane categories) and the min feature values would allow the user to basically only select a dataset with only hurricane category images and not images that are just clouds with wind speed of 0 for example. I basically rewrote the Tropical Cylcone dataset locally to have that functionality because it makes running experiments a lot easier, so I thought I would put that option in this dataset as well.

Feb 08 '24 12:02 nilsleh

Up to you I guess. You can use no normalization by default and allow the user to subclass and override the normalization values. Then the user is responsible for calculating mean/std themselves based on split/min/max.

Or you can just subtract min and divide by (max - min). Is there any reason why this wouldn't be a good idea?

Feb 08 '24 13:02 adamjstewart

torchgeo torchgeo copied to clipboard

Add Digital typhoon dataset

torchgeo
torchgeo copied to clipboard