nextflow
nextflow copied to clipboard
AWS batch "too many requests" errors are not retriable
Hello. Regarding this line and the similar logic in the functions around it:
https://github.com/nextflow-io/nextflow/blob/b099d430aa06b350ff63b6f3ae291dd72f49c779/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy#L767
We have seen quite a few TooManyRequests exceptions when Batch is under high load. They should be recoverable / retryable, but nextflow crashes out because the error code is 429.
Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 4ba5587d-f670-4c0e-986f-a24046298d69; Proxy: null)
To my understanding, the AWS Java SDK (used by Nextflow) automatically retries requests with 429 error code:
https://docs.aws.amazon.com/general/latest/gr/api-retries.html
So if you are getting this error from Nextflow then it means that the request has already been retried at least a few times. I don't think Nextflow currently allows these retry settings to be changed via nextflow.config
, so it would be good to add those settings so that you can experiment with them.
For now, you can adjust executor.submitRateLimit
and executor.pollInterval
so that Nextflow calls the AWS Batch API less frequently.
Thanks, it would be good to have more flexibility here. We are running hundreds of nextflow instances simultaneously, so the rate limiting is only getting us so far.
Some notes after looking into this issue:
- the
azurebatch
executor actually uses a third-party library calledfailsafe
to wrap API calls in a retry strategy - in the AWS java SDK, the retry settings can be configured via ClientConfiguration and RetryPolicy
- relevant options include the backoff strategy, max error retry, and retry "mode"
- there are a number of predefined backoff strategies which can all be configured with a base delay and max delay
So I think we can implement some or all of the following options:
-
aws.client.maxErrorRetry
(default 3) -
aws.client.retryMode
(can be ADAPTIVE, LEGACY, or STANDARD) -
aws.client.backoffStrategy
(can be equal jitter, exponential, full jitter, or sdk default) -
aws.client.baseDelay
-
aws.client.maxDelay
I don't see any way to provide a numerical value for the jitter. Also not sure if we should support the retry mode.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Nextflow has a config option aws.batch.retryMode
(docs), which respects rate-limiting responses by default. However this setting is only applied to CLI commands used by tasks and not to the AWS SDK used by Nextflow.
For Nextflow, you should set AWS_RETRY_MODE=standard
in your launch environment. Let me know if that helps.