aws-sdk-rust
aws-sdk-rust copied to clipboard
Should `ResponseError`s be retriable?
I just noticed that ResponseErrors are not automatically retried by the SDK. The specific example I was surprised by was an S3 ListObjectsV2 request that failed due to an incomplete body:
ResponseError { err: hyper::Error(Body, Custom { kind: UnexpectedEof, error: IncompleteBody }), raw: Response { inner: Response { status: 200, version: HTTP/1.1, headers: {"content-type": "application/xml; charset=utf-8", "content-length": "1064", "x-amz-bucket-region": "us-east-1", "x-amzn-requestid": "B7K3ZBYJ4GOS6NNXINT08OONX033EREZYSK4I1VPIK1PU0BI9VYR", "access-control-allow-origin": "*", "last-modified": "Mon, 22 Nov 2021 04:56:57 GMT", "x-amz-request-id": "0E15FCA83A3F698A", "x-amz-id-2": "MzRISOwyjmnup0E15FCA83A3F698A7/JypPGXLh0OVFGcJaaO3KW/hRAqKOpIEEp", "accept-ranges": "bytes", "content-language": "en-US", "access-control-allow-methods": "HEAD,GET,PUT,POST,DELETE,OPTIONS,PATCH", "access-control-allow-headers": "authorization,cache-control,content-length,content-md5,content-type,etag,location,x-amz-acl,x-amz-content-sha256,x-amz-date,x-amz-request-id,x-amz-security-token,x-amz-tagging,x-amz-target,x-amz-user-agent,x-amz-version-id,x-amzn-requestid,x-localstack-target,amz-sdk-invocation-id,amz-sdk-request", "access-control-expose-headers": "etag,x-amz-version-id", "connection": "close", "date": "Mon, 22 Nov 2021 04:57:03 GMT", "server": "hypercorn-h11"}, body: SdkBody { inner: Taken, retryable: false } }, properties: SharedPropertyBag(Mutex { data: PropertyBag, poisoned: false, .. }) } }
(FWIW I'm specifically testing retry behavior with injected faults; this isn't an actual error I've observed in production.)
Is it intentional that these aren't automatically retried? I'm not actually sure what the other SDKs do in this situation. I guess it could be a problem for non-idempotent requests, because they'll get duplicated, but that can already happen with requests that are retried after a timeout, where the client doesn't know whether the server actually processed the request already.
this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API + with_retry_policy to drop in a different policy.
As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.
As an aside, I'm curious about the framework / tooling you're using for fault generation, we'd like to use something similar in the SDK.
Ah yeah, it's not too complicated. It's closer to random chaos testing than it is to principled fault injection. Specifically we test that Materialize's S3 source is resilient to transient network failures by putting Toxiproxy between materialized and localstack. We add a toxic that drops out the connection after various amounts of traffic, sleep a bit, restore the connection, and then verify that Materialize still receives all the data from S3 that it ought to.
this is a good question. We're investigating on our side to determine how the other SDKs behave and what the correct behavior should be. It's worth noting that it's possible to completely override the retry policy by using the low-level operation API +
with_retry_policyto drop in a different policy.
Sounds good! For now we're just blindly retrying any errors that we see with some backoff, which works well enough: https://github.com/MaterializeInc/materialize/blob/dbf6c5df193f785e5bff364b5f21b1957a7c0322/src/dataflow/src/source/s3.rs#L316-L325. The downside is that we'll retry permanent errors a few times, but the upside is that we're 100% guaranteed to catch any transient errors.
thanks again for reporting this. this should be retried automatically by the SDK and we're working on a fix
I think this was fixed in version 0.8.0 by https://github.com/awslabs/smithy-rs/pull/1197
Are you still running into this?
Hmm, I'm not sure—ResponseError isn't covered in the retry policy:
let (err, response) = match err {
Ok(_) => return RetryKind::Unnecessary,
Err(SdkError::ServiceError { err, raw }) => (err, raw),
Err(SdkError::DispatchFailure(err)) => {
return if err.is_timeout() || err.is_io() {
RetryKind::Error(ErrorKind::TransientError)
} else if let Some(ek) = err.is_other() {
RetryKind::Error(ek)
} else {
RetryKind::UnretryableFailure
}
}
Err(_) => return RetryKind::UnretryableFailure,
};
The issue is that ResponseError is currently generic which prevents us from downcasting. I think we should change the inner type to ConnectorError
From discussion with @rcoh: The inner Box<Error> on ResponseError needs to be passed into RetryClassifier.
ResponseErrors are retried as of release-2022-09-21.
⚠️COMMENT VISIBILITY WARNING⚠️
Comments on closed issues are hard for our team to see. If you need more assistance, please either tag a team member or open a new issue that references this one. If you wish to keep having a conversation with other community members under this issue feel free to do so.