archivetar icon indicating copy to clipboard operation
archivetar copied to clipboard

Have archivetar not immediately fail if Globus is unavailable temporarily

Open brockpalen opened this issue 1 year ago • 2 comments

Currently if globus fails like how httpd dies sometimes, archivetar will immediately fail.

Desired outcome would be some period of time it could retry before giving up.

eg:

Unable to connect to <>:443\\nglobus_xio: System error in connect: Connection refused\\nglobus_xio: A system call failed: Connection refused\\n\n", 'eHotMF73v')

brockpalen avatar Jan 23 '24 17:01 brockpalen

I'm looking at https://github.com/jd/tenacity to implement some retries. Also, have an issue open with the Globus team to see if they have anything built-in or a best practice.

brockpalen avatar Feb 06 '24 16:02 brockpalen

From the Globus team:

The SDK supports timeout and retry customization via the client's .transport attribute, which is an instance of the RequestsTransport class [documentation link].

There are several customization options exposed as attributes, but I think that the following will be helpful in this situation:

    .TRANSIENT_ERROR_STATUS_CODES
    .retry_backoff()
    .max_retries

Looking at the archivetar code, it may be that code like this will accommodate longer retries, and enforce retries on HTTP 404:

# After instantiating the TransferClient
# --------------------------------------

# Add HTTP 404 as a status code that should be retried.
self.tc.transport.TRANSIENT_ERROR_STATUS_CODES += (404, )

# Retry once per second, without any backoff.
self.tc.transport.retry_backoff = lambda *_, **__: 1.0

# Allow up to 100 retries.
# This may result in more than 2 minutes of retries.
self.tc.transport.max_retries = 100

This will result in several minutes of retries before an exception is raised.

brockpalen avatar Feb 08 '24 16:02 brockpalen