torchgeo
torchgeo copied to clipboard
Download state data on demand in the ChesapeakeCVPR dataset
Currently, downloading the entire 140GB dataset zip file is prone to error out due to http errors. Potentially better solutions would be:
- Split up the zip file and have the Dataset object only download the files it needs (i.e. if a user only wants to run Delaware experiments, they will only need to download ~10-20GB)
- Use some azure library to directly download the files (148GB total)
Regardless, we should link to the azcopy
download instructions (bottom of https://lila.science/datasets/chesapeakelandcover) in the docs so that users can easily try other download options if the download=True
method isn't working.
@hannah-rae for visibility!
I would advocate for the first solution, since I have run into that issue as well, where a remote machine with GPU has not enough memory for all 140GB but the current code implementation always checks for all state files to be present. I can open a PR to change the code.
Okay I'll see about creating zip files for each state.
Yes, this is also how the non-CVPR version of the dataset is distributed.
Is this and #484 still a WIP or should we close them?