aws-sdk-rust Should `ResponseError`s be retriable?

trafficstars

I just noticed that ResponseErrors are not automatically retried by the SDK. The specific example I was surprised by was an S3 ListObjectsV2 request that failed due to an incomplete body:

ResponseError { err: hyper::Error(Body, Custom { kind: UnexpectedEof, error: IncompleteBody }), raw: Response { inner: Response { status: 200, version: HTTP/1.1, headers: {"content-type": "application/xml; charset=utf-8", "content-length": "1064", "x-amz-bucket-region": "us-east-1", "x-amzn-requestid": "B7K3ZBYJ4GOS6NNXINT08OONX033EREZYSK4I1VPIK1PU0BI9VYR", "access-control-allow-origin": "*", "last-modified": "Mon, 22 Nov 2021 04:56:57 GMT", "x-amz-request-id": "0E15FCA83A3F698A", "x-amz-id-2": "MzRISOwyjmnup0E15FCA83A3F698A7/JypPGXLh0OVFGcJaaO3KW/hRAqKOpIEEp", "accept-ranges": "bytes", "content-language": "en-US", "access-control-allow-methods": "HEAD,GET,PUT,POST,DELETE,OPTIONS,PATCH", "access-control-allow-headers": "authorization,cache-control,content-length,content-md5,content-type,etag,location,x-amz-acl,x-amz-content-sha256,x-amz-date,x-amz-request-id,x-amz-security-token,x-amz-tagging,x-amz-target,x-amz-user-agent,x-amz-version-id,x-amzn-requestid,x-localstack-target,amz-sdk-invocation-id,amz-sdk-request", "access-control-expose-headers": "etag,x-amz-version-id", "connection": "close", "date": "Mon, 22 Nov 2021 04:57:03 GMT", "server": "hypercorn-h11"}, body: SdkBody { inner: Taken, retryable: false } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } }

(FWIW I'm specifically testing retry behavior with injected faults; this isn't an actual error I've observed in production.)

Is it intentional that these aren't automatically retried? I'm not actually sure what the other SDKs do in this situation. I guess it could be a problem for non-idempotent requests, because they'll get duplicated, but that can already happen with requests that are retried after a timeout, where the client doesn't know whether the server actually processed the request already.

Nov 22 '21 05:11 benesch

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

Nov 22 '21 17:11 rcoh

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

Nov 22 '21 18:11 rcoh

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

Ah yeah, it's not too complicated. It's closer to random chaos testing than it is to principled fault injection. Specifically we test that Materialize's S3 source is resilient to transient network failures by putting Toxiproxy between materialized and localstack. We add a toxic that drops out the connection after various amounts of traffic, sleep a bit, restore the connection, and then verify that Materialize still receives all the data from S3 that it ought to.

Nov 22 '21 22:11 benesch

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

Sounds good! For now we're just blindly retrying any errors that we see with some backoff, which works well enough: https://github.com/MaterializeInc/materialize/blob/dbf6c5df193f785e5bff364b5f21b1957a7c0322/src/dataflow/src/source/s3.rs#L316-L325. The downside is that we'll retry permanent errors a few times, but the upside is that we're 100% guaranteed to catch any transient errors.

Nov 23 '21 04:11 benesch

thanks again for reporting this. this should be retried automatically by the SDK and we're working on a fix

Nov 26 '21 21:11 rcoh

I think this was fixed in version 0.8.0 by https://github.com/awslabs/smithy-rs/pull/1197

Are you still running into this?

Apr 01 '22 18:04 jdisanti

Hmm, I'm not sure—ResponseError isn't covered in the retry policy:

        let (err, response) = match err {
            Ok(_) => return RetryKind::Unnecessary,
            Err(SdkError::ServiceError { err, raw }) => (err, raw),
            Err(SdkError::DispatchFailure(err)) => {
                return if err.is_timeout() || err.is_io() {
                    RetryKind::Error(ErrorKind::TransientError)
                } else if let Some(ek) = err.is_other() {
                    RetryKind::Error(ek)
                } else {
                    RetryKind::UnretryableFailure
                }
            }
            Err(_) => return RetryKind::UnretryableFailure,
        };

The issue is that ResponseError is currently generic which prevents us from downcasting. I think we should change the inner type to ConnectorError

Apr 01 '22 19:04 rcoh

From discussion with @rcoh: The inner Box<Error> on ResponseError needs to be passed into RetryClassifier.

Jul 20 '22 17:07 jdisanti

ResponseErrors are retried as of release-2022-09-21.

Sep 21 '22 01:09 jdisanti

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

Sep 21 '22 01:09 github-actions[bot]

aws-sdk-rust aws-sdk-rust copied to clipboard

Should `ResponseError`s be retriable?

⚠️COMMENT VISIBILITY WARNING⚠️

aws-sdk-rust
aws-sdk-rust copied to clipboard