torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Dataset downloading expected behavior pt. 2

Open calebrob6 opened this issue 4 years ago • 5 comments
trafficstars

When I instantiate a dataset with download=False and checksum=False I expect it to assume everything is in place, however our current setup usually checks to make sure that the archive file exists. If the archive file is 100+ GB then it is totally reasonable for users to delete it but keep the downloaded data.

I think _check_integrity should be something like this:

    def _check_integrity(self) -> bool:
        """Check integrity of dataset.

        Returns:
            True if dataset MD5s match, else False
        """
        return check_integrity(
            os.path.join(self.root, self.filename),
            self.md5,
        )

and we only call it if self.checksum is true.

Datasets that still use _check_integrity (and presumably follow the old convention):

  • [ ] Advance
  • [x] Benin Cashews
  • [ ] CBF
  • [ ] COWC
  • [x] CV4A Kenya
  • [x] Cyclone
  • [ ] ETCI 2021
  • [ ] EuroCrops
  • [ ] Eurosat
  • [ ] GID15
  • [ ] LEVIRCD
  • [ ] Loveda
  • [ ] SEN12MS
  • [ ] So2Sat
  • [x] SpaceNet
  • [ ] UCMerced
  • [ ] VHR-10

For a good example to copy, I think CDL is the first dataset I updated to use the new download style.

calebrob6 avatar Sep 01 '21 20:09 calebrob6

Our current approach is similar to what torchvision does (although we added checksum=False because our files are just too big). I personally think there are a lot of downsides of the way torchvision does things. If I was going to design things from scratch, I think I would like the behavior to be something like:

  1. Check to see if the extracted files exist (this may mean checking for a STAC index file or looking for a single image, we can't check for the existence of every file)
  2. If extracted files don't exist, check for download tarball/zipfile and extract/decompress it if it exists
  3. If neither exist, download it

This has the benefit that if you download the tarball but don't extract torchgeo won't re-download it.

adamjstewart avatar Sep 02 '21 20:09 adamjstewart

@estherrolf (one of our first users!) just hit this problem with a manually downloaded version of the ChesapeakeCVPR dataset. I think it is worth making the change as this will happen especially with the larger datasets.

calebrob6 avatar Sep 04 '21 22:09 calebrob6

I also don't like the "dataset not found or corrupted" error message. Like, which is it??

adamjstewart avatar Sep 07 '21 15:09 adamjstewart

I think we need a list here of all datasets that do not follow these conventions / a good example dataset to copy.

calebrob6 avatar Feb 15 '22 18:02 calebrob6

EDIT: added this list to the initial comment so GitHub can track progress.

adamjstewart avatar Mar 16 '22 19:03 adamjstewart