wxee icon indicating copy to clipboard operation
wxee copied to clipboard

Improve download stability

Open aazuspan opened this issue 3 years ago • 4 comments

The current download system is pretty solid with automated retrying, but the cdsapi package has a more extensive system that should improve download stability. See their implementation for reference.

aazuspan avatar Sep 10 '21 00:09 aazuspan

There are a few recent additions to the EE API that may make this easier.

  1. ee.Image.getDownloadURL now accepts a format parameter so we no longer have to deal with zipped GeoTiffs.
  2. ee.data.computePixels allows image downloads without the intermediate URL generation step. I'm not sure what the performance implications are, but it should at least simplify if not speed up downloads. This method seems to accept the same parameters and be subject to the same size restrictions as getDownloadURL. I believe this was previously REST API only, but is now available through the Python client API.

aazuspan avatar Mar 08 '23 01:03 aazuspan

Just ran some benchmarks on download speed and fsspec seems to have a big advantage over the current requests system. It can also handle concurrent downloads out of the box. That may be useful, but unfortunately I don't think it will be enough to let us drop joblib as a dependency since we'll still need that for grabbing URLs.

I don't love the idea of adding a new dependency, but if it can reduce download times substantially and simplify the download system, I think it's worth adding fsspec.

aazuspan avatar Apr 11 '23 02:04 aazuspan

With ee.data.computePixels now available in the Python API (as of 2023-02-15), that will probably be the most straightforward way to grab image data.

It has the same size limitation as other methods, but allows data to be retrieved directly rather than through an intermediate URL, which should be a win for performance, simplicity, and reliability. Also, this would allow us to avoid adding fsspec and probably remove requests as dependencies.

I need to do some benchmarking to make sure there are no downsides, but at the moment this looks like the way to go. Note that as with all direct GEO_TIFF format downloads, it does not currently export band names, which means we unfortunately have to grab them manually with getInfo.

aazuspan avatar Jun 27 '23 02:06 aazuspan

A quick-and-dirty benchmark test says computePixels is noticeably faster than downloading with fsspec and the current requests implementation, even for a single image where you have to grab bandNames. With more images, that improvement should scale since bandNames will only need to be retrieved once.

Time to download a single-band GridMET image at native resolution:

Method Time
getDownloadURL + requests 3.2s
getDownloadURL + fsspec 2.2s
computePixels 1.9s

aazuspan avatar Jun 27 '23 04:06 aazuspan