torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

RandomGeoSampler for Pre-chipped Datasets

Open yichiac opened this issue 10 months ago • 6 comments

This PR replaces RandomBatchGeoSampler with RandomGeoSampler for Agrifieldnet, Sentinel2_CDL, Sentinel2_NCCM, Sentinel2_EuroCrops, and Sentinel2_SouthAmericaSoybean datamodules. RandomGeoSampler works better for the pre-chipped datasets because it returns the correct random bounding boxes.

yichiac avatar Apr 02 '24 14:04 yichiac

We should do this for all the datamodules you're using, not just these two.

adamjstewart avatar Apr 05 '24 14:04 adamjstewart

Tests fail because the fake data consists of a single file and 10% of 1 is 0.

adamjstewart avatar Apr 15 '24 10:04 adamjstewart

That's all I want to add to this PR. Thanks for reviewing the code, @adamjstewart!

yichiac avatar Apr 15 '24 15:04 yichiac

This PR adds 1K new files to git. Do we really need that many files?

adamjstewart avatar May 01 '24 14:05 adamjstewart

If we change the split portion from [0.8, 0.1, 0.1] to something like [0.6, 0.2, 0.2], we can reduce 50% of test images. random_bbox_assignment(self.dataset, [0.8, 0.1, 0.1], generator=generator) For the current split portion, we need at least 10 images to get at least 1 val/test image. For [0.6, 0.2, 0.2], we just need 5 images. The split portion requiring the minimum number of images would be [0.3, 0.3, 0.3], which only needs 3 images.

That being said, are 500 images too many to be added to git? We can manually change the split portion later using [0.8, 0.1, 0.1].

yichiac avatar May 01 '24 15:05 yichiac

If we change the split portion from [0.8, 0.1, 0.1] to something like [0.6, 0.2, 0.2], we can reduce 50% of test images.

We shouldn't change the default split just to make testing easier. If you want to make the split ratio an input parameter (like we do for many other data modules), then we could use [0.33, 0.33, 0.33] only in CI and use [0.8, 0.1, 0.1] by default.

we need at least 10 images

I'm fine with at least 10 images, I'm less fine with at least 1000 images.

adamjstewart avatar May 02 '24 11:05 adamjstewart

@yichiac is this still needed?

adamjstewart avatar Aug 06 '24 11:08 adamjstewart

I think it is still better to include this change. If we can reduce the number of test files, it would be great to have RandomGeoSampler for pre-chipped datasets.

yichiac avatar Aug 08 '24 02:08 yichiac

I don't think the sampler is important here, the splitter is what's important.

I just feel like most users will either use raw Sentinel-2 tiles or write their own NonGeoDataset for the pre-chipped data. GeoDatasets don't make as much sense for pre-chipped data, they're really for raw data.

adamjstewart avatar Aug 08 '24 09:08 adamjstewart

@yichiac We should make a decision on this PR before the TorchGeo 0.6.0 release next week. Basically, it boils down to one question. Are these data modules designed for working with:

  1. arbitrary Sentinel-2 tiles and raw mask products, or
  2. the pre-chipped harmonized version of the dataset you created?

If 1, this PR should be closed, as the data modules already work as intended. If 2, we need to document this. We could also consider automatically downloading the harmonized version of the dataset in datamodule.prepare_date and any other hacks you needed for your paper reproducibility.

adamjstewart avatar Aug 21 '24 09:08 adamjstewart

I lean to option 1. This PR could be closed.

yichiac avatar Aug 21 '24 18:08 yichiac