elastic-ci-stack-for-aws icon indicating copy to clipboard operation
elastic-ci-stack-for-aws copied to clipboard

Retry s3.GetBucketRegion API

Open keithduncan opened this issue 2 years ago • 9 comments

It has been reported in Slack that this API call can be an intermittent source of job failure.

This API call was added in v5.5.0 as part of supporting cross-region secrets buckets in the elastic-ci-stack-s3-secrets-hooks submodule.

  • [ ] Places to fix handling for 503 Service Unavailable
    • [x] s3secrets-helper, s3.go New https://github.com/buildkite/elastic-ci-stack-s3-secrets-hooks/issues/51
    • [x] git-credentials-s3-secrets
    • [ ] agent artifacts s3.go newS3Client https://github.com/buildkite/agent/issues/1525
  • [x] Improve the error in s3secrets to better describe
    • [x] What was the process doing when it encountered this error

keithduncan avatar Oct 05 '21 04:10 keithduncan

Some guidance from S3 that all 5xx errors are retryable, recommending exponential backoff and up to 10 retries https://aws.amazon.com/premiumsupport/knowledge-center/http-5xx-errors-s3/

keithduncan avatar Oct 12 '21 01:10 keithduncan

I took a look at aws-sdk-go-v2 to see if there had been any improvements in this area. It looks like some 5xx errors are retried by default.

Considering this affects artifact upload/download in the agent too I’m going to trial a migration in the s3secrets-helper to see how it goes.

keithduncan avatar Oct 19 '21 06:10 keithduncan

Yeah nice. I used aws-sdk-go-v2 a few times before it was GA, and it seemed nicer than v1. It was in beta/preview status for ages, I think I missed, or forgot, that it went GA with v1.0.0 on 2021-01-19 👍🏼

pda avatar Oct 19 '21 07:10 pda

I’ve found docs on the AWS CLI v1 and v2 that says they already retry 503 errors so I’ve ticked the git-credentials-s3-secrets todo.

keithduncan avatar Oct 26 '21 04:10 keithduncan

Even at the current default of 3 retries, I've lately been seeing this fail an incredible amount (I just had to rerun a build 5 times); error message looks something like:

~~~ :llama: Setting up elastic stack environment (v5.7.2)

Checking docker

CONTAINER ID   IMAGE     COMMAND   CREATED   STATUS    PORTS     NAMES

Checking disk space

Disk space free: 217G

Inodes free: 125M

Configuring built-in plugins

Secrets plugin enabled

ECR plugin enabled

Docker-login plugin enabled

Discovered current region as "us-east-2"

fatal error: Could not discover the region for bucket "<my-secrets-bucket>": (operation error S3: HeadBucket, exceeded maximum number of attempts, 3, https response error StatusCode: 503, RequestID: <snip>, HostID: <snip>, api error ServiceUnavailable: Service Unavailable)

glittershark avatar Nov 04 '21 19:11 glittershark

:sob: sorry this has proven so unreliable @glittershark, can you confirm whether your case falls into the category of a bucket in the same region as the stack?

While this was added in service of cross-region bucket support, we can’t afford to have it impair same-region support like this, and I apologise.

We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.

keithduncan avatar Nov 05 '21 00:11 keithduncan

@keithduncan oh, to be clear, I don't view this as your fault at all - from what I can tell this is entirely AWS being super flaky (unless I'm misunderstanding the issue) - regardless, super grateful for the awesome responsiveness you've had on all the issues in this repo :)

can you confirm whether your case falls into the category of a bucket in the same region as the stack?

Yeah, confirmed both the stack and the bucket are in us-east-2

We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.

Personally I'd be totally happy with either of these options - passing an extra parameter is super straightforward, especially since all our stacks are in terraform.

glittershark avatar Nov 05 '21 14:11 glittershark

We've started to see the frequency of this failure escalate quite a bit. We are also in us-east-2 and our secrets bucket is same-region.

geoffharcourt avatar Nov 12 '21 01:11 geoffharcourt

We can revise this to support a final fallback to the discovered "current" region if dynamic discovery fails. While that won’t be functional in the cross region case, if the API is this flakey the only other option I can think of is requiring a stack parameter for the secret bucket’s region which bypasses dynamic discovery completely.

I’ve added both of these and they have landed in https://github.com/buildkite/elastic-ci-stack-for-aws/pull/962 which will be released in v5.7.3 shortly. The changes are minor beyond that change if any one is interested in using it pre-release by updating your stacks to use the master branch built template at https://s3.amazonaws.com/buildkite-aws-stack/master/aws-stack.yml

keithduncan avatar Nov 24 '21 05:11 keithduncan