nextflow icon indicating copy to clipboard operation
nextflow copied to clipboard

AWS batch "too many requests" errors are not retriable

Open jwarwick-delfi opened this issue 1 year ago • 3 comments

Hello. Regarding this line and the similar logic in the functions around it:

https://github.com/nextflow-io/nextflow/blob/b099d430aa06b350ff63b6f3ae291dd72f49c779/plugins/nf-amazon/src/main/nextflow/cloud/aws/batch/AwsBatchTaskHandler.groovy#L767

We have seen quite a few TooManyRequests exceptions when Batch is under high load. They should be recoverable / retryable, but nextflow crashes out because the error code is 429.

Too Many Requests (Service: AWSBatch; Status Code: 429; Error Code: TooManyRequestsException; Request ID: 4ba5587d-f670-4c0e-986f-a24046298d69; Proxy: null)

jwarwick-delfi avatar Jul 28 '22 21:07 jwarwick-delfi

To my understanding, the AWS Java SDK (used by Nextflow) automatically retries requests with 429 error code:

https://docs.aws.amazon.com/general/latest/gr/api-retries.html

So if you are getting this error from Nextflow then it means that the request has already been retried at least a few times. I don't think Nextflow currently allows these retry settings to be changed via nextflow.config, so it would be good to add those settings so that you can experiment with them.

For now, you can adjust executor.submitRateLimit and executor.pollInterval so that Nextflow calls the AWS Batch API less frequently.

bentsherman avatar Jul 29 '22 14:07 bentsherman

Thanks, it would be good to have more flexibility here. We are running hundreds of nextflow instances simultaneously, so the rate limiting is only getting us so far.

jwarwick-delfi avatar Jul 29 '22 19:07 jwarwick-delfi

Some notes after looking into this issue:

  • the azurebatch executor actually uses a third-party library called failsafe to wrap API calls in a retry strategy
  • in the AWS java SDK, the retry settings can be configured via ClientConfiguration and RetryPolicy
  • relevant options include the backoff strategy, max error retry, and retry "mode"
  • there are a number of predefined backoff strategies which can all be configured with a base delay and max delay

So I think we can implement some or all of the following options:

  • aws.client.maxErrorRetry (default 3)
  • aws.client.retryMode (can be ADAPTIVE, LEGACY, or STANDARD)
  • aws.client.backoffStrategy (can be equal jitter, exponential, full jitter, or sdk default)
  • aws.client.baseDelay
  • aws.client.maxDelay

I don't see any way to provide a numerical value for the jitter. Also not sure if we should support the retry mode.

bentsherman avatar Aug 02 '22 21:08 bentsherman

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 31 '22 21:12 stale[bot]

Nextflow has a config option aws.batch.retryMode (docs), which respects rate-limiting responses by default. However this setting is only applied to CLI commands used by tasks and not to the AWS SDK used by Nextflow.

For Nextflow, you should set AWS_RETRY_MODE=standard in your launch environment. Let me know if that helps.

bentsherman avatar Aug 24 '23 16:08 bentsherman