torchgeo Add augmentation to USAVars Dataset from paper code base

This PR adds the Resize Augmentation from the Paper code base found here: https://github.com/Global-Policy-Lab/mosaiks-paper/blob/master/code/analysis/1_feature_extraction/2_featurize_models_deep_pretrained.py

Jun 20 '23 15:06 nilsleh

Poking around the code, I also see:

Not sure which of these are actually run or the code just exists for.

@calebrob6 why did we call this dataset USAVars instead of MOSAIKS?

Jun 20 '23 16:06 adamjstewart

MOSAIKS is the name of a method (Multi-task Observation using Satellite Imagery & Kitchen Sinks (MOSAIKS)) that can be applied generally. USAVars is a better name for a dataset.

Jun 20 '23 23:06 calebrob6

Poking around the code, I also see:

I want to use this dataset for a project and am trying to reproduce the reported results they have with a lightning setup instead of their big custom code base and will report which augmentations are needed to reproduce their scores.

Jun 21 '23 08:06 nilsleh

Computed Image statistics on torchgeo train dataset split:

min: array([0., 0., 0., 0.], dtype=float32)
max: array([1., 1., 1., 1.], dtype=float32)
mean: array([0.4101762, 0.4342503, 0.3484594, 0.5473533], dtype=float32)
std: array([0.17361328, 0.14048962, 0.12148701, 0.16887303], dtype=float32)

quiet different from the imagenet stats they use: mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]

Jun 21 '23 09:06 nilsleh

More normalization

Whitening

I think those normalizations are unique to the MOSAIK model they use. But these are the augmentations for CNN based approach.

Jun 21 '23 11:06 nilsleh

In that case should we add RandomHorizontalFlip and ImageNet normalization?

Jun 21 '23 15:06 adamjstewart

yeah, I want to try and reproduce results first and will update the PR here then.

Jun 22 '23 09:06 nilsleh

@calebrob6 do the train/val/test splits that come with the torchgeo dataset version, correspond to any of the checkerboard style splits as seen in Figure 3 of the Mosaik paper or are these random splits?

Additionally, target variable normalization is also relevant for regression tasks. This is done here in their code. Should we add this target variable normalization as well, or at least document the mean/std values somewhere so people don't have to compute these values themselves?

Jun 23 '23 07:06 nilsleh

I'm pretty sure they are random splits.

Also, it looks like the download isn't working (the storage account permissions were automatically switched from anonymous access to private), so I need to move this to huggingface.

Also, this isn't an exactly replication of their dataset as they used Google Earth imagery (I think) while this is NAIP imagery.

Jun 23 '23 15:06 calebrob6

With a resnet18 baseline I get 0.95 R-Squared score for treecover (paper 0.91) when doing proper normalization. Since we cannot replicate their results directly anyway as Caleb pointed out, I would suggest to just use the computed normalization statistics on this dataset, and I think adding support for target value normalization would be good as well.

Jun 29 '23 12:06 nilsleh

Hi, just saw this. Chiming in on a few things and please let me know if I can be helpful with anything else @nilsleh!

Yes we do target variable normalization as is standard for regression. Note also that some of the target variables are transformed as y_transformed = log(1+y) (and performance is then reported with respect to the logged variables).
As Caleb pointed out, the USAVars data here is based on NAIP imagery whereas the analysis in our paper is based on google imagery, so unfortunately don't expect the results to match up exactly with the numbers in the paper.
In light of ^, if choosing to resize the imagery during preprocessing (or not), there is likely going to be a different optimal patch size for the NAIP imagery than for the imagery we use in the paper.
It's possible that a different preprocessing of the images would be helpful for the CNN baseline or for MOSAIKS -- especially in light of these results: https://arxiv.org/abs/2305.13456. At the time of doing the experiments, we did what made the most sense for a solid and reasonable baseline: ZCA whitening for RCF (implemented here) following the explanation in footnote 14 here) and standard augmentation strategies for the Resnet-18 model as you've noted above.

Jul 11 '23 15:07 estherrolf

@nilsleh should we try to sneak this into v0.5.2?

Feb 29 '24 12:02 adamjstewart

torchgeo torchgeo copied to clipboard

Add augmentation to USAVars Dataset from paper code base

torchgeo
torchgeo copied to clipboard