torchgeo
torchgeo copied to clipboard
Consistency of checking dataset integrity/verify
Checking whether the data has been downloaded or can be found in root
directory is handled in two ways in our dataset implementations:
Option 1 roughly:
def __init__(self):
if download:
self._download(api_key)
if not self._check_integrity():
raise RuntimeError(
"Dataset not found or corrupted. "
+ "You can use download=True to download it"
)
def _check_integrity(self) -> bool:
"""Check integrity of dataset.
Returns:
True if dataset files are found and/or MD5s match, else False
"""
def _download(self, api_key: Optional[str] = None) -> None:
"""Download the dataset and extract it."""
Option 2 roughly:
def __init__(self):
self._verify()
def _verify(self) -> None:
"""Verify the integrity of the dataset.
Raises:
RuntimeError: if ``download=False`` but dataset is missing or checksum fails
"""
def _download(self) -> None:
"""Download the dataset."""
def _extract(self) -> None:
"""Extract the dataset."""
Should we settle on one or the other? Personally I think the _verify()
option is a bit cleaner.
This partially relates to #47 and #99. The latter is the new style I would like to use moving forward. Would love it if someone would convert all of the old style datasets to the new style but haven't had time. If we ever get around to using torchdata we can kill two birds with one stone.
I can help with changing to the new style.
Should prob focus on torchdata first, otherwise you'll have to redo all of that work.