torchgeo
torchgeo copied to clipboard
Dataset downloading expected behavior pt. 2
When I instantiate a dataset with download=False and checksum=False I expect it to assume everything is in place, however our current setup usually checks to make sure that the archive file exists. If the archive file is 100+ GB then it is totally reasonable for users to delete it but keep the downloaded data.
I think _check_integrity should be something like this:
def _check_integrity(self) -> bool:
"""Check integrity of dataset.
Returns:
True if dataset MD5s match, else False
"""
return check_integrity(
os.path.join(self.root, self.filename),
self.md5,
)
and we only call it if self.checksum is true.
Datasets that still use _check_integrity (and presumably follow the old convention):
- [ ] Advance
- [x] Benin Cashews
- [ ] CBF
- [ ] COWC
- [x] CV4A Kenya
- [x] Cyclone
- [ ] ETCI 2021
- [ ] EuroCrops
- [ ] Eurosat
- [ ] GID15
- [ ] LEVIRCD
- [ ] Loveda
- [ ] SEN12MS
- [ ] So2Sat
- [x] SpaceNet
- [ ] UCMerced
- [ ] VHR-10
For a good example to copy, I think CDL is the first dataset I updated to use the new download style.
Our current approach is similar to what torchvision does (although we added checksum=False because our files are just too big). I personally think there are a lot of downsides of the way torchvision does things. If I was going to design things from scratch, I think I would like the behavior to be something like:
- Check to see if the extracted files exist (this may mean checking for a STAC index file or looking for a single image, we can't check for the existence of every file)
- If extracted files don't exist, check for download tarball/zipfile and extract/decompress it if it exists
- If neither exist, download it
This has the benefit that if you download the tarball but don't extract torchgeo won't re-download it.
@estherrolf (one of our first users!) just hit this problem with a manually downloaded version of the ChesapeakeCVPR dataset. I think it is worth making the change as this will happen especially with the larger datasets.
I also don't like the "dataset not found or corrupted" error message. Like, which is it??
I think we need a list here of all datasets that do not follow these conventions / a good example dataset to copy.
EDIT: added this list to the initial comment so GitHub can track progress.