NLCD - Dataset not found in `paths='data'`
Description
With download=True a 42 byte tif is downloaded, but the error DatasetNotFoundError: Dataset not found in paths='data'and cannot be automatically downloaded, either specify a differentpaths or manually download the dataset. is raised
Steps to reproduce
from torchgeo.datasets import NLCD
dataset = NLCD(
years=[2023],
download=True,
checksum=False
)
Version
0.7.1
I took a look at this and it appears like the download is returning an invalid file that can't be opened with rasterio. Need to investigate further.
edit: seems like all the NLCD download links are broken and return 403 Forbidden status except for 2023 which just returns a corrupt file.
edit2: seems like the files are hosted here now and in zip files https://www.mrlc.gov/downloads/sciweb1/shared/mrlc/data-bundles/Annual_NLCD_LndCov_2024_CU_C1V1.zip so we could update the dataset to download from this url structure instead and unzip.
Hello, I'm currently working on this bugfix. As the downloaded filetype changed to zip, I am going to use Chesapeake as a template for the processing of zipfile.
One question: is there a reason why the zip files are not deleted after extraction? Do you think it would be worth deleting the zip file after extraction by default, with a parameter to keep it if necessary?
@adamjstewart can confirm, but I assume we want to keep the current behaviour
We rely on torchvision.datasets.utils for most of our download/extract utilities. Many of these functions do have a remove_finished flag we could use for this, but I don't think any datasets in TorchGeo or torchvision use it. I don't really care either way, but would like to remain consistent. If people want to manually delete the zipfile themselves, we should not redownload it, just use the extracted version.