torchgeo icon indicating copy to clipboard operation
torchgeo copied to clipboard

Datasets: add azcopy download support

Open adamjstewart opened this issue 1 year ago • 0 comments

This PR adds an azcopy function to torchgeo.datasets.utils that makes it easier to download datasets from Azure Blob Storage (such as Source Cooperative). It's basically just a wrapper around subprocess.run, but with a more useful error message if azcopy isn't installed. It can be used as follows:

from torchgeo.datasets.utils import azcopy

azcopy("sync", "https://radiantearth.blob.core.windows.net/mlhub/nasa-tropical-storm-challenge", ".", "--recursive=true")

The hardest part was testing. We don't want our tests to require internet access or download massive datasets, so we need to use local fake data to test. But we also can't get full test coverage unless we actually attempt to "download" the data, and azcopy doesn't support local <-> local file transfers like rsync does. My solution was to create a fake azcopy command that can copy local files and inject this first in the PATH. I don't know of a reliable way to test when this command isn't available, so we may need to change CI a bit.

Prerequisite for #1830 Closes #1887 Closes #1915

@Haimantika @darkblue-b Once this is reviewed and merged, I could use your help in porting our existing datasets to use this (full list in #1830). Unfortunately, many of the datasets seemingly completely changed their file hierarchy, so some of them may require more than just a simple one-function update.

adamjstewart avatar May 03 '24 17:05 adamjstewart