torchgeo Dataset downloading expected behavior pt. 2

trafficstars

When I instantiate a dataset with download=False and checksum=False I expect it to assume everything is in place, however our current setup usually checks to make sure that the archive file exists. If the archive file is 100+ GB then it is totally reasonable for users to delete it but keep the downloaded data.

I think _check_integrity should be something like this:

    def _check_integrity(self) -> bool:
        """Check integrity of dataset.

        Returns:
            True if dataset MD5s match, else False
        """
        return check_integrity(
            os.path.join(self.root, self.filename),
            self.md5,
        )

and we only call it if self.checksum is true.

Datasets that still use _check_integrity (and presumably follow the old convention):

[ ] Advance
[x] Benin Cashews
[ ] CBF
[ ] COWC
[x] CV4A Kenya
[x] Cyclone
[ ] ETCI 2021
[ ] EuroCrops
[ ] Eurosat
[ ] GID15
[ ] LEVIRCD
[ ] Loveda
[ ] SEN12MS
[ ] So2Sat
[x] SpaceNet
[ ] UCMerced
[ ] VHR-10

For a good example to copy, I think CDL is the first dataset I updated to use the new download style.

Sep 01 '21 20:09 calebrob6

Our current approach is similar to what torchvision does (although we added checksum=False because our files are just too big). I personally think there are a lot of downsides of the way torchvision does things. If I was going to design things from scratch, I think I would like the behavior to be something like:

Check to see if the extracted files exist (this may mean checking for a STAC index file or looking for a single image, we can't check for the existence of every file)
If extracted files don't exist, check for download tarball/zipfile and extract/decompress it if it exists
If neither exist, download it

This has the benefit that if you download the tarball but don't extract torchgeo won't re-download it.

Sep 02 '21 20:09 adamjstewart

@estherrolf (one of our first users!) just hit this problem with a manually downloaded version of the ChesapeakeCVPR dataset. I think it is worth making the change as this will happen especially with the larger datasets.

Sep 04 '21 22:09 calebrob6

I also don't like the "dataset not found or corrupted" error message. Like, which is it??

Sep 07 '21 15:09 adamjstewart

I think we need a list here of all datasets that do not follow these conventions / a good example dataset to copy.

Feb 15 '22 18:02 calebrob6

EDIT: added this list to the initial comment so GitHub can track progress.

Mar 16 '22 19:03 adamjstewart

torchgeo torchgeo copied to clipboard

Dataset downloading expected behavior pt. 2

torchgeo
torchgeo copied to clipboard