torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Download state data on demand in the ChesapeakeCVPR dataset

Open calebrob6 opened this issue 2 years ago • 4 comments

Currently, downloading the entire 140GB dataset zip file is prone to error out due to http errors. Potentially better solutions would be:

  • Split up the zip file and have the Dataset object only download the files it needs (i.e. if a user only wants to run Delaware experiments, they will only need to download ~10-20GB)
  • Use some azure library to directly download the files (148GB total)

Regardless, we should link to the azcopy download instructions (bottom of https://lila.science/datasets/chesapeakelandcover) in the docs so that users can easily try other download options if the download=True method isn't working.

@hannah-rae for visibility!

calebrob6 avatar Mar 03 '22 16:03 calebrob6

I would advocate for the first solution, since I have run into that issue as well, where a remote machine with GPU has not enough memory for all 140GB but the current code implementation always checks for all state files to be present. I can open a PR to change the code.

nilsleh avatar Mar 04 '22 11:03 nilsleh

Okay I'll see about creating zip files for each state.

calebrob6 avatar Mar 04 '22 17:03 calebrob6

Yes, this is also how the non-CVPR version of the dataset is distributed.

adamjstewart avatar Mar 04 '22 19:03 adamjstewart

Is this and #484 still a WIP or should we close them?

adamjstewart avatar Sep 29 '23 20:09 adamjstewart