earthaccess icon indicating copy to clipboard operation
earthaccess copied to clipboard

Implement download retry mechanism

Open mfisher87 opened this issue 1 year ago • 4 comments

We discussed this in a few places now:

https://github.com/nsidc/earthaccess/issues/481#issuecomment-1978680018

https://github.com/nsidc/earthaccess/issues/594#issuecomment-2161829080

Figured it's time for a dedicated issue :grin:

mfisher87 avatar Jun 12 '24 00:06 mfisher87

Any updates on this plan?

I'm trying to process a month of data using the earthaccess tool to grab 1 PACE file at a time but for some reason earthdata is giving timeout errors quite often making it difficult to actually process multiple days of files. Not sure if there is an issue with earthdata so I've reached out to them (awaiting a response), but also wondering if the retry option might be available soon. (or is there at least a way to get an error message returned from earthaccess to implement a manual retry after a 30 or 60 second wait)

Thanks!

zfasnacht1013 avatar Oct 27 '24 21:10 zfasnacht1013

Hi @zfasnacht1013, Could you provide a code snippet that's giving you timeout errors? I would like to bring it to the attention of the OB.DAAC if it seems that's where the problem is. Thanks for reporting!

itcarroll avatar Oct 28 '24 14:10 itcarroll

@itcarroll

It's something as simple as

import earthaccess 

start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'

min_lon = -120; max_lon = -100; min_lat = 20; max_lat = 40
earthaccess.login(persist=True)
results = earthaccess.search_data(short_name = 'PACE_OCI_L1B_SCI',cloud_hosted=True,temporal=(start_date,end_date),count=400,bounding_box=(min_lon,min_lat,max_lon,max_lat),version='2')

earthaccess.download(results,'')

I don't think it's an OBDAAC issue though, I've been having issues with TROPOMI data as well.

The problem is, I'm trying to grab say 6 PACE granules at a time. Normally only 1 or 2 fails, but of course then I have gaps. Also, I'm trying to download the files temporarily, then delete when I'm done with them because I don't want to be storing TB's of PACE data locally.

zfasnacht1013 avatar Oct 28 '24 14:10 zfasnacht1013

Similar to @zfasnacht1013 , I tried using earthaccess.search_data/earthaccess.download to download multiple files (~ a month of GPM_MERGIR files in my case). The first time, 3 failed with HTTP error 500. The second time, 1 (of those 3) failed. Third time OK.

zmoon avatar Oct 28 '24 15:10 zmoon

Until we support a configurable retry mechanism for downloading, here is a workaround (a modification of the code given in a previous comment), which makes use of the tenacity library:

import earthaccess 
import tenacity  # NEW IMPORT

start_date = '2024-05-01 00:00:00'
end_date = '2024-05-01 23:59:00'

min_lon, max_lon, min_lat, max_lat = -120, -100, 20, 40
earthaccess.login(persist=True)

# ----- BEGIN NEW CODE (must appear AFTER calling earthaccess.login)

# Create a retrier function, wrapping the earthaccess.Store._download_file function so
# that it will simply retry each failing download (using exponential backoff to help
# avoid resource contention). By replacing the existing function with the wrapper, when
# we call earthaccess.download, it will use our wrapper to download each file.
always_retry = tenacity.retry(wait=tenacity.wait_random_exponential(multiplier=1, max=60))
tenaciously_download_file = always_retry(earthaccess.__store__._download_file)
earthaccess.__store__._download_file = tenaciously_download_file

# ----- END NEW CODE

results = earthaccess.search_data(
    short_name='PACE_OCI_L1B_SCI',
    cloud_hosted=True,
    temporal=(start_date, end_date),
    count=400,
    bounding_box=(min_lon, min_lat, max_lon, max_lat),
    version='2',
)

earthaccess.download(results, '')

chuckwondo avatar Oct 31 '24 15:10 chuckwondo

@zfasnacht1013, although the workaround above should do the trick, and can also serve as a basis for directly adding such functionality to earthaccess, would you mind elaborating your use case, if you can?

In general, we want to discourage fully downloading files, and instead provide advice on how to avoid such downloading, and instead on how to perform direct reads, grabbing only the parts of the files containing the data you need, assuming you don't actually need the files in their entirety.

chuckwondo avatar Nov 02 '24 14:11 chuckwondo

@chuckwondo I'm developing research trace gas algorithms for PACE. I'm using the PACE L1b at 1km and use the full spectra of reflectances, so there's really not much of a way to subset the files before I use them. I need to start upscaling a produce 1-2 months to use for validation. Since it's a research product, I'm developing it in NCCS and not in any PACE sand box.

This is going to be a general theme moving forward, not only with PACE, but other instruments. We are doing something similar with TEMPO and will also need to grab large chunks of data to process and develop our products for validation.

I'm not sure this is something that has been considered much yet at Earthdata. I would assume the ideal scenario for earth data is that folks work in AWS for development to limit the network transfer, but since AWS is pay-to-play and NCCS is not, with budgets being tight, we are left to develop in NCCS.

This might be something for further discussion to solution ideas on how to move forward with this kind of use case. Feel free to reach out to me if we should have a meeting to further discuss.

zfasnacht1013 avatar Nov 02 '24 14:11 zfasnacht1013

Closing as implemented in #1061 !!

betolink avatar Aug 27 '25 14:08 betolink