torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Consistency of checking dataset integrity/verify

Open nilsleh opened this issue 1 year ago • 3 comments

Checking whether the data has been downloaded or can be found in root directory is handled in two ways in our dataset implementations:

Option 1 roughly:

def __init__(self):
    if download:
        self._download(api_key)

    if not self._check_integrity():
        raise RuntimeError(
              "Dataset not found or corrupted. "
              + "You can use download=True to download it"
        )

def _check_integrity(self) -> bool:
        """Check integrity of dataset.

        Returns:
            True if dataset files are found and/or MD5s match, else False
        """

 def _download(self, api_key: Optional[str] = None) -> None:
        """Download the dataset and extract it."""

Option 2 roughly:

def __init__(self):
    self._verify()

def _verify(self) -> None:
        """Verify the integrity of the dataset.

        Raises:
            RuntimeError: if ``download=False`` but dataset is missing or checksum fails
        """
    
def _download(self) -> None:
        """Download the dataset."""

def _extract(self) -> None:
        """Extract the dataset."""

Should we settle on one or the other? Personally I think the _verify() option is a bit cleaner.

nilsleh avatar Apr 18 '23 07:04 nilsleh

This partially relates to #47 and #99. The latter is the new style I would like to use moving forward. Would love it if someone would convert all of the old style datasets to the new style but haven't had time. If we ever get around to using torchdata we can kill two birds with one stone.

adamjstewart avatar Apr 18 '23 15:04 adamjstewart

I can help with changing to the new style.

nilsleh avatar Apr 19 '23 17:04 nilsleh

Should prob focus on torchdata first, otherwise you'll have to redo all of that work.

adamjstewart avatar Apr 19 '23 17:04 adamjstewart