botocore icon indicating copy to clipboard operation
botocore copied to clipboard

NoCredentialsError due to wrongly breaking the retry loop

Open attilasimon72 opened this issue 4 years ago • 3 comments

Description In some cases IMDSFetcher._fetch_metadata_token can ignore your AWS_METADATA_SERVICE_NUM_ATTEMPTS configuration, and you will end up getting NoCredentialsError on the first attempt.

Details Our web app runs on EC2 with role-based authentication and relies on InstanceMetadataFetcher.

We have this config for the metadata service:

AWS_METADATA_SERVICE_TIMEOUT=5 AWS_METADATA_SERVICE_NUM_ATTEMPTS=3

When the load goes high and autoscaling launches a new EC2 instance, the web app of the new instance starts to bomb the metadata service with credential requests. (Boto sessions are created upon the first need, then they persist for the lifetime of the workers, but due to the high load on startup all the workers are busy getting the credentials in parallel). As a result of this scenario, a low number of boto sessions fail to get credential, and due to #1791 they get stuck in repeatedly raising NoCredentialsError.

We attempted a trivial workaround by setting the number of retries higher:

AWS_METADATA_SERVICE_TIMEOUT=5 AWS_METADATA_SERVICE_NUM_ATTEMPTS=12

But no change, the same chain of NoCredentialsError again.

By adding logs to our code we identified that instead of exhausting all the configured 12 attempts, InstanceMetadataFetcher gave up after the first one.

Looking at the code it was easy to point out the reason in IMDSFetcher._fetch_metadata_token:

If the first attempt ends with ReadTimeoutError in the for loop of _fetch_metadata_token then the loop will break prematurely, failing to exhaust configured AWS_METADATA_SERVICE_NUM_ATTEMPTS. If you look at the definition of RETRYABLE_HTTP_ERRORS at the top of the module, you can find it also has ReadTimeoutError, which indicates it is planned as retriable.

RETRYABLE_HTTP_ERRORS = (
    ReadTimeoutError, EndpointConnectionError, ConnectionClosedError,
    ConnectTimeoutError,
)
...
   def _fetch_metadata_token(self):
        self._assert_enabled()
        url = self._base_url + self._TOKEN_PATH
        headers = {
            'x-aws-ec2-metadata-token-ttl-seconds': self._TOKEN_TTL,
        }
        self._add_user_agent(headers)
        request = botocore.awsrequest.AWSRequest(
            method='PUT', url=url, headers=headers)
        for i in range(self._num_attempts):
            try:
                response = self._session.send(request.prepare())
                if response.status_code == 200:
                    return response.text
                elif response.status_code in (404, 403, 405):
                    return None
                elif response.status_code in (400,):
                    raise BadIMDSRequestError(request)
            except ReadTimeoutError:
                return None
            except RETRYABLE_HTTP_ERRORS as e:
                logger.debug(
                    "Caught retryable HTTP exception while making metadata "
                    "service request to %s: %s", url, e, exc_info=True)
            except HTTPClientError as e:
                if isinstance(e.kwargs.get('error'), LocationParseError):
                    raise InvalidIMDSEndpointError(endpoint=url, error=e)
                else:
                    raise
        return None

Expected behavior ReadTimeoutError should not break the loop, because it violates the documented behaviour of AWS_METADATA_SERVICE_NUM_ATTEMPTS.

Commit and PR introducing this The ReadTimeoutError handler was added by commit 66ee37d8965c371fcfa3060d8fb9946b3fe15d78 and PR #1895.

Obviously there was a reason for adding the change, but it created a contradiction with the list of RETRYABLE_HTTP_ERRORS, and also violated the documented behaviour of AWS_METADATA_SERVICE_NUM_ATTEMPTS.

Workaround

AWS_METADATA_SERVICE_TIMEOUT=60 AWS_METADATA_SERVICE_NUM_ATTEMPTS=1

Setting the values above defeats the impact of early exit on ReadTimeoutError.

Related issues Unresolved issues found by this search might be related.

attilasimon72 avatar Feb 18 '21 18:02 attilasimon72

@stealthycoin any chance we could get some clarification? Is the plan to drop timeouts as a retrying cause and change the meaning of AWS_METADATA_SERVICE_NUM_ATTEMPTS? The lack of retrying combined with no exception getting raised downstream (#1791) causes poisoned containers in Kubernetes setups.

JustinTArthur avatar Jun 13 '21 22:06 JustinTArthur

Hi @JustinTArthur,

My apologies for the delayed response. I saw you probably had a chance to look at #1895— this change was made shortly after the rollout of imdsV2, where changes that we made to the credential-fetching strategy broke users whose deployments were incompatible with imdsV2 (see https://github.com/boto/botocore/issues/1892, https://github.com/boto/botocore/issues/1897 and https://github.com/aws/aws-cli/issues/4682).

At the time, ReadTimeout errors were retry-able, but in each case of incompatibility, customers were hitting a ReadTimeoutError, so the thought was to have this fail fast so they could quickly fall back to imdsvV1. Reverting this change would likely break people again.

I've spoken with the team about this and we're willing to consider a flag or something to that effect that would allow ReadTimeout errors to be retry-able, but we're interested in hearing more about what would be most helpful in getting around this.

stobrien89 avatar Sep 23 '21 22:09 stobrien89

https://github.com/boto/botocore/pull/1752/files

made by this

omonimus1 avatar Mar 23 '22 21:03 omonimus1

Greetings! It looks like this issue hasn’t been active in longer than five days. We encourage you to check if this is still an issue in the latest release. In the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or upvote with a reaction on the initial post to prevent automatic closure. If the issue is already closed, please feel free to open a new one.

github-actions[bot] avatar Mar 07 '24 00:03 github-actions[bot]