aws-sdk-rust icon indicating copy to clipboard operation
aws-sdk-rust copied to clipboard

Should `ResponseError`s be retriable?

Open benesch opened this issue 3 years ago • 8 comments
trafficstars

I just noticed that ResponseErrors are not automatically retried by the SDK. The specific example I was surprised by was an S3 ListObjectsV2 request that failed due to an incomplete body:

ResponseError { err: hyper::Error(Body, Custom { kind: UnexpectedEof, error: IncompleteBody }), raw: Response { inner: Response { status: 200, version: HTTP/1.1, headers: {"content-type": "application/xml; charset=utf-8", "content-length": "1064", "x-amz-bucket-region": "us-east-1", "x-amzn-requestid": "B7K3ZBYJ4GOS6NNXINT08OONX033EREZYSK4I1VPIK1PU0BI9VYR", "access-control-allow-origin": "*", "last-modified": "Mon, 22 Nov 2021 04:56:57 GMT", "x-amz-request-id": "0E15FCA83A3F698A", "x-amz-id-2": "MzRISOwyjmnup0E15FCA83A3F698A7/JypPGXLh0OVFGcJaaO3KW/hRAqKOpIEEp", "accept-ranges": "bytes", "content-language": "en-US", "access-control-allow-methods": "HEAD,GET,PUT,POST,DELETE,OPTIONS,PATCH", "access-control-allow-headers": "authorization,cache-control,content-length,content-md5,content-type,etag,location,x-amz-acl,x-amz-content-sha256,x-amz-date,x-amz-request-id,x-amz-security-token,x-amz-tagging,x-amz-target,x-amz-user-agent,x-amz-version-id,x-amzn-requestid,x-localstack-target,amz-sdk-invocation-id,amz-sdk-request", "access-control-expose-headers": "etag,x-amz-version-id", "connection": "close", "date": "Mon, 22 Nov 2021 04:57:03 GMT", "server": "hypercorn-h11"}, body: SdkBody { inner: Taken, retryable: false } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } }

(FWIW I'm specifically testing retry behavior with injected faults; this isn't an actual error I've observed in production.)

Is it intentional that these aren't automatically retried? I'm not actually sure what the other SDKs do in this situation. I guess it could be a problem for non-idempotent requests, because they'll get duplicated, but that can already happen with requests that are retried after a timeout, where the client doesn't know whether the server actually processed the request already.

benesch avatar Nov 22 '21 05:11 benesch

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

rcoh avatar Nov 22 '21 17:11 rcoh

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

rcoh avatar Nov 22 '21 18:11 rcoh

As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.

Ah yeah, it's not too complicated. It's closer to random chaos testing than it is to principled fault injection. Specifically we test that Materialize's S3 source is resilient to transient network failures by putting Toxiproxy between materialized and localstack. We add a toxic that drops out the connection after various amounts of traffic, sleep a bit, restore the connection, and then verify that Materialize still receives all the data from S3 that it ought to.

benesch avatar Nov 22 '21 22:11 benesch

this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.

Sounds good! For now we're just blindly retrying any errors that we see with some backoff, which works well enough: https://github.com/MaterializeInc/materialize/blob/dbf6c5df193f785e5bff364b5f21b1957a7c0322/src/dataflow/src/source/s3.rs#L316-L325. The downside is that we'll retry permanent errors a few times, but the upside is that we're 100% guaranteed to catch any transient errors.

benesch avatar Nov 23 '21 04:11 benesch

thanks again for reporting this. this should be retried automatically by the SDK and we're working on a fix

rcoh avatar Nov 26 '21 21:11 rcoh

I think this was fixed in version 0.8.0 by https://github.com/awslabs/smithy-rs/pull/1197

Are you still running into this?

jdisanti avatar Apr 01 '22 18:04 jdisanti

Hmm, I'm not sure—ResponseError isn't covered in the retry policy:

        let (err, response) = match err {
            Ok(_) => return RetryKind::Unnecessary,
            Err(SdkError::ServiceError { err, raw }) => (err, raw),
            Err(SdkError::DispatchFailure(err)) => {
                return if err.is_timeout() || err.is_io() {
                    RetryKind::Error(ErrorKind::TransientError)
                } else if let Some(ek) = err.is_other() {
                    RetryKind::Error(ek)
                } else {
                    RetryKind::UnretryableFailure
                }
            }
            Err(_) => return RetryKind::UnretryableFailure,
        };

The issue is that ResponseError is currently generic which prevents us from downcasting. I think we should change the inner type to ConnectorError

rcoh avatar Apr 01 '22 19:04 rcoh

From discussion with @rcoh: The inner Box<Error> on ResponseError needs to be passed into RetryClassifier.

jdisanti avatar Jul 20 '22 17:07 jdisanti

ResponseErrors are retried as of release-2022-09-21.

jdisanti avatar Sep 21 '22 01:09 jdisanti

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.

github-actions[bot] avatar Sep 21 '22 01:09 github-actions[bot]