aws-sdk-go-v2 icon indicating copy to clipboard operation
aws-sdk-go-v2 copied to clipboard

InstancesTerminatedWaiter not retry DescribeInstances properly

Open ghuang0570 opened this issue 2 years ago • 9 comments

Describe the bug

We are using InstanceTerminatedWaiter to wait for a few ASG-managed instances to be in "terminated" state after we delete the ASG. However, we discovered that this wait operation often times out with an error of exceeded max wait time for InstanceTerminated waiter.

Here are log lines from our code that hits this issue (with LogWaitAttempts enabled):

[2022-09-27T23:18:15Z] I0927 23:18:15.889351   10361 cluster.go:93] waiting for 7 instances to be terminated
[2022-09-27T23:18:15Z] I0927 23:18:15.889432   10361 clientset.go:138] attempting waiter request, attempt count: 1
[2022-09-27T23:18:30Z] I0927 23:18:30.963136   10361 clientset.go:138] attempting waiter request, attempt count: 2
[2022-09-27T23:18:55Z] I0927 23:18:55.725513   10361 clientset.go:138] attempting waiter request, attempt count: 3
[2022-09-27T23:33:15Z] I0927 23:33:15.890142   10361 e2e.go:66] exceeded max wait time for InstanceTerminated waiter

The max time out is set to 15 minutes, however, we can only see 3 attempts during the first minute of waiting. Then the next log line is 15 mins later with an error message about max wait time exceeded.

We've cross checked our CloudTrail events, and confirmed that the 3 logged attempts are the only attempts of the DescribeInstances API calls.

Expected Behavior

We expect the waiter to make more attempts during the 15 mins max timeout, and eventually detects all instances being terminated.

Current Behavior

The InstanceTerminatedWaiter only attempts 2 or 3 times then it seems to get stuck and times out eventually. There are few cases were the InstanceTerminatedWaiter actually worked, but it's happening less frequently for us.

Reproduction Steps

  1. Create an AutoScalingGroup with X instances.
  2. After the ASG has reached the desired capacity, call Delete ASG
  3. Then use InstancesTerminatedWaiter to see if they are termiatned.

Possible Solution

No response

Additional Information/Context

Here's the code snippet of how we use the waiter:

func enableLogger(o *awsec2.InstanceTerminatedWaiterOptions) {
	o.LogWaitAttempts = true
}

func func (c Client) WaitForInstanceTerminated(ctx context.Context, ids []string) error {
	waiter := awsec2.NewInstanceTerminatedWaiter(c.EC2API)
	log.V(1).Infof("waiting for %d instances to be terminated", len(ids))
	if err := waiter.Wait(ctx, &awsec2.DescribeInstancesInput{InstanceIds: ids}, 15*time.Minute); err != nil {
		return err
	}

AWS Go SDK V2 Module Versions Used

github.com/aws/[email protected]

Compiler and Version used

go version go1.19.1

Operating System and version

Debian 11 (bullseye)/amd64

ghuang0570 avatar Sep 28 '22 00:09 ghuang0570

Also, I think this has something to do with the DecribeInstances API call hanging but without a timeout.

ghuang0570 avatar Sep 28 '22 17:09 ghuang0570

Hi @gh-ghuang ,

After investigating this issue I'm pretty confident this is not an SDK related bug.

The waiter will hang until a response is given from the server, whereas the Retryer will retry 2 additional times if a retryable error is returned:

Wait Attempt 1
-- API Invoke Attempt #1 => Got an API error that can be retried
-- API Invoke Attempt #2 => Got success response from service

First wait attempt didn’t not find instances in expected state
Wait Attempt 2
-- API Invoke Attempt #1 => Got an API error that can be retried
-- API Invoke Attempt #2 => Got an API error that can be retried
-- API Invoke Attempt #3 => Got success response from service

Second wait attempt found instances in terminated state
Stop

So the service has returned an error. You are expecting the retryer to retry more times but what you really want to do is investigate why the service is returning an error. I suggest you do that either by enabling retryer logging:

	cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion("us-east-1"), config.WithClientLogMode(aws.LogRetries))
	if err != nil {
		log.Fatalf("unable to load SDK config, %v", err)
	}

or by putting a break point here and seeing what kind of error the service has returned.

Also, I think this has something to do with the DecribeInstances API call hanging but without a timeout.

The waiter will hang for the amount you tell it to (in your case 15min) or until it gets a response from the service API.

Let me know if that helps, Thanks, Ran~

RanVaknin avatar Sep 29 '22 17:09 RanVaknin

Should https://github.com/aws/aws-sdk-go-v2/blob/e4f0cba14e534e673c643841b1cd140c9470d85b/service/ec2/api_op_DescribeInstances.go?rgh-link-date=2022-09-29T17%3A48%3A57Z#L1299 which is used in https://github.com/aws/aws-sdk-go-v2/blob/e4f0cba14e534e673c643841b1cd140c9470d85b/service/ec2/api_op_DescribeInstances.go?rgh-link-date=2022-09-29T17%3A48%3A57Z#L1318 use maxWaitDur as the context timeout for a particular request?

I wonder if it would be better if the per request timeout is limited to, let's say, 30s. (Maybe we can create a per request ctx). We pass 15 minutes as the maxWaitDur but a single DescribeInstances call should take much less time. I wonder if network issues or something is causing the request to not return and the waiter is just hung on a single request.

rohitagarwal003 avatar Sep 29 '22 22:09 rohitagarwal003

@RanVaknin I think the behavior of hanging a single request during a wait until the maxWaitDur runs out is kinda surprising. So the waiter would have been more intuitive if each DescribeInstances request has a lower limit.

And what could be the underlying issue that would cause the request to take 15m to finish? It that an expected SLA for this API to take this long?

ghuang0570 avatar Sep 29 '22 22:09 ghuang0570

@gh-ghuang ,

  • Please enable logging and share the logs here so we can have a better idea of what is happening. pass this to your client config: config.WithClientLogMode(aws.LogRetries | aws.LogAttempts)

  • Do you have any custom Retryer implementation on the client used with the waiter?

  • Are you providing any sort of custom http client configurations for timeouts etc?

Thanks, Ran~

RanVaknin avatar Sep 29 '22 23:09 RanVaknin

Please enable logging and share the logs here so we can have a better idea of what is happening. pass this to your client config: config.WithClientLogMode(aws.LogRetries | aws.LogAttempts)

Here are some log traces I found after using aws.LogRetries | aws.LogRequestWithBody | aws.LogResponseWithBody and awsec2.InstanceTerminatedWaiterOptions = true.

[2022-09-30T20:09:14Z] I0930 20:09:14.712410    7720 clientset.go:138] attempting waiter request, attempt count: 1
[2022-09-30T20:09:29Z] I0930 20:09:29.784497    7720 clientset.go:138] attempting waiter request, attempt count: 2
[2022-09-30T20:09:54Z] I0930 20:09:54.940608    7720 clientset.go:138] attempting waiter request, attempt count: 3
[2022-09-30T20:24:14Z] I0930 20:24:14.713233    7720 e2e.go:66] failed to tear down the cell or account: failed to tear down cell: exceeded max wait time for InstanceTerminated waiter

For some reason, the aws.LogRetries | aws.LogRequestWithBody | aws.LogResponseWithBody was NOT showing more log lines than just having awsec2.InstanceTerminatedWaiterOptions = true.

The following is what I got when I only have aws.LogRetries | aws.LogRequestWithBody | aws.LogResponseWithBody:

[2022-09-30T18:55:30Z] I0930 18:55:30.353213   11914 cluster.go:88] terminating all EC2 instances for cluster mgmt
[2022-09-30T19:10:30Z] I0930 19:10:30.856543   11914 e2e.go:66] failed to tear down the cell or account: failed to tear down cell: failed to describe instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline

Please note the 15 mins elapsed time between the two log lines, which is the maxWaitDur.

Do you have any custom Retryer implementation on the client used with the waiter?

No, not used in this case.

Are you providing any sort of custom http client configurations for timeouts etc?

No, we don't have custom http configs, we just use the default.

ghuang0570 avatar Sep 30 '22 20:09 ghuang0570

Hi @gh-ghuang ,

Thanks for your response. This is odd. Are you sure you are enabling logging correctly? Here is a code sample with logging enabled.

func main() {
	cfg, err := config.LoadDefaultConfig(context.TODO(), config.WithRegion("us-east-1"), config.WithClientLogMode(aws.LogRetries|aws.LogRequestWithBody|aws.LogResponseWithBody))
	if err != nil {
		log.Fatalf("unable to load SDK config, %v", err)
	}

	client := ec2.NewFromConfig(cfg)

	waiter := ec2.NewInstanceTerminatedWaiter(client, func(options *ec2.InstanceTerminatedWaiterOptions) {
		options.LogWaitAttempts = true
	})

	err = waiter.Wait(context.Background(), &ec2.DescribeInstancesInput{InstanceIds: []string{"instanceID"}}, 15*time.Minute)
	if err != nil {
		panic(err)
	}
	fmt.Println("terminated successfully...")
}

It's a shot in the dark without logging but here is some steps you can try:

No, we don't have custom http configs, we just use the default.

You might want to try to configure your client http options like so:

	client := ec2.NewFromConfig(cfg, func(options *ec2.Options) {
		options.HTTPClient = http.NewBuildableClient().WithDialerOptions(func(dialer *net.Dialer) {
			dialer.Timeout = 5 * time.Second
		}).WithTransportOptions(func(transport *http2.Transport) {
			transport.TLSHandshakeTimeout = 5 * time.Second
		})
	})

Try to play around with the timeouts of the dialer, and / or the TLS handshake timeout. Maybe that will help?

Like I said, until we have logs I can't really tell.

RanVaknin avatar Oct 03 '22 18:10 RanVaknin

[2022-09-30T19:10:30Z] I0930 19:10:30.856543   11914 e2e.go:66] failed to tear down the cell or account: failed to tear down cell: failed to describe instances: operation error EC2: DescribeInstances, https response error StatusCode: 0, RequestID: , canceled, context deadline

Does this log line looks like what should be when enabled by aws.LogRetries|aws.LogRequestWithBody|aws.LogResponseWithBody? I got that one log line after having the 3 modes OR'ed. Without it, I wouldn't even see RequestID: .

Is an empty request ID suspicious to you?

Re: client config. I'm not sure if we want to set a client config globally like this, we might ended up with tuning this forever to make all the APIs we use work smoothly. It will be nice to configure this at a per request basis. IIRC, the SDK is designed to make use of the context.WithTimeout() as the per request timeout setting. However, for a waiter (which is expected to set individual timeouts each time it polls), this pattern doesn't work.

ghuang0570 avatar Oct 03 '22 18:10 ghuang0570

Also, I have filed a support ticket, and attached the request ID in one of the failure cases. I will update here for findings from that support case.

ghuang0570 avatar Oct 03 '22 19:10 ghuang0570

Also, I have filed a support ticket, and attached the request ID in one of the failure cases. I will update here for findings from that support case.

A support ticket with AWS? Those eventually end up at our internal queue so I don't know if that's very helpful.

Can you please share some of your code? Im not sure why you are not seeing any logs.

Thanks, Ran~

RanVaknin avatar Oct 05 '22 21:10 RanVaknin

Can you please share some of your code? Im not sure why you are not seeing any logs.

I figured out why the log lines are missing. We have our own logger, and I wasn't setting the right log level.

Please allow me sometime to reproduce this. I will share out the log lines when I hit this again. Somehow the latency has gone better and this happens less frequently this week. I will keep running it until I hit one.

ghuang0570 avatar Oct 05 '22 23:10 ghuang0570

@RanVaknin

I was able to reproduce this with (LogRetries | LogRequest | LogResponse). See the following log lines for the 2nd and 3rd attempts. The 2nd attempt went through fine, but some instances didn't meet the condition so we had the 3rd attempt. The 3rd attempt seems to be hanging while waiting on the response, so only the request is logged.

[2022-10-06T02:39:39Z] I1006 02:39:39.164108    5561 clientset.go:138] attempting waiter request, attempt count: 2
--
  | [2022-10-06T02:39:39Z] I1006 02:39:39.164325    5561 clientset.go:138] Request
  | [2022-10-06T02:39:39Z] POST / HTTP/1.1
  | [2022-10-06T02:39:39Z] Host: ec2.us-east-1.amazonaws.com
  | [2022-10-06T02:39:39Z] User-Agent: aws-sdk-go-v2/1.16.16 os/linux lang/go/1.19.1 md/GOOS/linux md/GOARCH/amd64 api/ec2/1.58.0
  | [2022-10-06T02:39:39Z] Content-Length: 274
  | [2022-10-06T02:39:39Z] Amz-Sdk-Invocation-Id: 84ab2200-ac9a-40c8-ab1a-fc6158aa01e2
  | [2022-10-06T02:39:39Z] Amz-Sdk-Request: attempt=1; max=3
  | [2022-10-06T02:39:39Z] Authorization: <redacted>
  | [2022-10-06T02:39:39Z] Content-Type: application/x-www-form-urlencoded
  | [2022-10-06T02:39:39Z] X-Amz-Date: 20221006T023939Z
  | [2022-10-06T02:39:39Z] X-Amz-Security-Token: <redacted>
[2022-10-06T02:39:39Z] I1006 02:39:39.307642    5561 clientset.go:138] Response
--
  | [2022-10-06T02:39:39Z] HTTP/1.1 200 OK
  | [2022-10-06T02:39:39Z] Transfer-Encoding: chunked
  | [2022-10-06T02:39:39Z] Cache-Control: no-cache, no-store
  | [2022-10-06T02:39:39Z] Content-Type: text/xml;charset=UTF-8
  | [2022-10-06T02:39:39Z] Date: Thu, 06 Oct 2022 02:39:38 GMT
  | [2022-10-06T02:39:39Z] Server: AmazonEC2
  | [2022-10-06T02:39:39Z] Strict-Transport-Security: max-age=31536000; includeSubDomains
  | [2022-10-06T02:39:39Z] Vary: accept-encoding
  | [2022-10-06T02:39:39Z] X-Amzn-Requestid: fd31281d-6479-42d3-b6bc-0fbd2f45cae2
  | [2022-10-06T02:39:39Z]
[2022-10-06T02:40:41Z] I1006 02:40:41.192427    5610 clientset.go:138] attempting waiter request, attempt count: 3
--
  | [2022-10-06T02:40:41Z] I1006 02:40:41.192703    5610 clientset.go:138] Request
  | [2022-10-06T02:40:41Z] POST / HTTP/1.1
  | [2022-10-06T02:40:41Z] Host: ec2.us-east-1.amazonaws.com
  | [2022-10-06T02:40:41Z] User-Agent: aws-sdk-go-v2/1.16.16 os/linux lang/go/1.19.1 md/GOOS/linux md/GOARCH/amd64 api/ec2/1.58.0
  | [2022-10-06T02:40:41Z] Content-Length: 274
  | [2022-10-06T02:40:41Z] Amz-Sdk-Invocation-Id: 3e12c19f-7010-449e-b685-9f51e198e2f1
  | [2022-10-06T02:40:41Z] Amz-Sdk-Request: attempt=1; max=3
  | [2022-10-06T02:40:41Z] Authorization: <redacted>
  | [2022-10-06T02:40:41Z] Content-Type: application/x-www-form-urlencoded
  | [2022-10-06T02:40:41Z] X-Amz-Date: 20221006T024041Z
  | [2022-10-06T02:40:41Z] X-Amz-Security-Token: <redacted>
  | [2022-10-06T02:40:41Z] Accept-Encoding: gzip
  | [2022-10-06T02:40:41Z]
  | [2022-10-06T02:55:04Z] I1006 02:55:04.986976    5610 clientset.go:138] request failed with unretryable error https response error StatusCode: 0, RequestID: , canceled, context deadline exceeded

It's definitely hanging, and I don't think the request log here would provide enough information into why the request hangs here.

I think without per request timeout setting inside the waiter, this would happen sometimes and we ended up using all the context timeout on just one attempt.

ghuang0570 avatar Oct 06 '22 05:10 ghuang0570

Hi @gh-ghuang ,

What you are asking to is quite difficult for me to test because the way your resource set up is very complicated.

In theory you should add a middleware stage after the last stage of the client request construction (finalize step). In the middleware we can set the context timeout to be 10 seconds so whenever the client makes an API request it caps it at 10 seconds.

	myMiddleware := middleware.FinalizeMiddlewareFunc("MyTestMiddleware",
		func(ctx context.Context, input middleware.FinalizeInput, next middleware.FinalizeHandler,
		) (
			output middleware.FinalizeOutput, metadata middleware.Metadata, err error,
		) {
			ctx2, cancel := context.WithTimeout(context.Background(), 10*time.Second)
			defer cancel()
			return next.HandleFinalize(ctx2, input)
		})

	client := ec2.NewFromConfig(cfg)

	waiter := ec2.NewInstanceTerminatedWaiter(client, func(options *ec2.InstanceTerminatedWaiterOptions) {
		options.LogWaitAttempts = true
		options.APIOptions = append(options.APIOptions, func(stack *middleware.Stack) error {
			return stack.Finalize.Insert(myMiddleware, "fooMIddleware", middleware.After)
		})
	})

Let me know if this helps.

Thanks, Ran

RanVaknin avatar Oct 15 '22 08:10 RanVaknin

@gh-ghuang,

By the way I received your internal ticket. Since we are corresponding here do you think you are able to close that one with the support team?

Thanks, Ran~

RanVaknin avatar Oct 15 '22 10:10 RanVaknin

This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

github-actions[bot] avatar Oct 18 '22 00:10 github-actions[bot]

@RanVaknin

I don't think the middleware solution is graceful enough for us since it changes the behavior for all requests using that client. In our code base, we construct one client and reused that throughout the entire execution. Effectively it's kinda the same as configuring at the httpclient level. I still think the waiter should provide an option to timeout each attempt which should be separate from the context timeout.

IIUC, the way AWS Go SDK uses the context timeout works perfectly fine for a single request. But for a compound operation that potentially would make multiple requests over a single context timeout, not having the ability to control the per-attempt timeout is very counter-intuitive.

re: AWS support ticket, I'd like to keep that one open for a little bit.

ghuang0570 avatar Oct 18 '22 16:10 ghuang0570

Hi @gh-ghuang ,

I will discuss it with the team and see if we can add this is a feature request to the backlog, but in the spirit of full transparency since theres a workaround,and since there is no bandwidth right now for new feature requests, if this will be added to the backlog it will get addressed some time in the future.

Just a friendly reminder, the SDK is an open source platform and we are accepting pull requests from community members. If you are able to push a fix for that desired behavior yourself, I strongly encourage you to do so.

I will give my update to the technical account manager handling your internal case.

Thank you very much, Ran

RanVaknin avatar Oct 18 '22 17:10 RanVaknin

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Oct 18 '22 17:10 github-actions[bot]

@RanVaknin I closed the support ticket to favor of moving the discussion here.

Thanks for making this a feature request, I understand that this will be in the backlog and addressed in the future. And it's okay for us to wait for AWS team to add this to the SDK in a future version.

ghuang0570 avatar Oct 18 '22 18:10 ghuang0570

I don't think the middleware solution is graceful enough for us since it changes the behavior for all requests using that client.

That's not the case in @RanVaknin's example though - the middleware here is being applied to the waiter's APIOptions, not the whole client. The imposed timeout will only be enforced in API calls made by the waiter, which accomplishes the desired behavior. I recognize this issue has shifted to a feature request for being able to configure per-op timeout in waiters, but it's very much possible to get that behavior today without direct configuration support from us, and it can absolutely be done without applying a blanket timeout on the client.

Much like @RanVaknin said I'm concerned that we're talking past the real issue here i.e. the fact that DescribeInstances apparently decides to not respond for 15 minutes. I don't necessarily take it as a given that it would decide to respond if we timed out prematurely and tried again, though it's a possibility.

@ghuang0570 I realize this issue has sat for a while - was there any followup on this on your end?

lucix-aws avatar Nov 30 '23 14:11 lucix-aws

Aside, this can be done without having to deal with middleware injection - remember the client itself is an input to the waiter, we can just wrap one to enforce the timeout

type timed struct {
    client ec2.DescribeInstancesAPIClient
}

func (t *timed) DescribeInstances(ctx context.Context, in *ec2.DescribeInstancesInput, opts ...func(*ec2.Options)) (*ec2.DescribeInstancesOutput, error) {
    ctx, cancel := context.WithTimeout(ctx, 10*time.Second) // or make it configurable
    defer cancel()

    return t.client.DescribeInstances(ctx, in, opts...)
}

waiter := ec2.NewInstanceTerminatedWaiter(&timed{client: client})

lucix-aws avatar Nov 30 '23 14:11 lucix-aws

This issue has not received a response in 1 week. If you want to keep this issue open, please just leave a comment below and auto-close will be canceled.

github-actions[bot] avatar Dec 03 '23 00:12 github-actions[bot]

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Dec 07 '23 15:12 github-actions[bot]

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Dec 07 '23 15:12 github-actions[bot]