aws-sdk-rust
aws-sdk-rust copied to clipboard
Upticks of S3 timeout errors after SDK upgrade
Describe the bug
Recently after SDK upgrade to aws-s3-sdk(>1.14), we notice upward trend of S3 GET timeout errors in production. We already ruled out the issue from #1118 . In our case, the error message is TransientError due to hitting attempt timeout.
There are correlation with connection timeout setting with the number of errors we've seen. There are also correlation with the load that we send to S3 to the number of errors.
Expected Behavior
Our timeout setting is as follow:
connection timeout: default to 3.1
attempt timeout: 800ms
operation timeout: 2.6s
total attempts: 3
We expect S3 request to success during this 2.6s.
Current Behavior
SDK did retry 3 times as we check. But still, we timeout after 3 attempts exhausted.
2024-04-16T22:08:15.844Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #1 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 47.57774ms
...(backoff 47.57 then 800ms after)
2024-04-16T22:08:16.693Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: attempt #2 failed with RetryIndicated(RetryableError { kind: TransientError, retry_after: None }); retrying after 34.98639ms
...(backoff 34.98 then 800ms after)
2024-04-16T22:08:17.530Z DEBUG aws_smithy_runtime::client::retries::strategy::standard: not retrying because we are out of attempts attempts=3 max_attempts=3
2024-04-16T22:08:17.530Z ERROR {redacted} GetObject request failed for key "{redacted}" with error TimeoutError(TimeoutError { source: MaybeTimeoutError { kind: OperationAttempt, duration: 800ms } })
2024-04-16T22:08:17.530Z ERROR {redacted} Get Object request failed: "request has timed out"
Smithy orchestrator typically emit halting line before the TransientError. We couldn't tell whether connection is established successfully within these 800ms or not, as there is no identifier between hyper logs vs. SDK logs.
2024-04-11T21:26:03.775Z DEBUG invoke{service=s3 operation=GetObject}:try_op:try_attempt: aws_smithy_runtime::client::orchestrator: encountered orchestrator error; halting
Reproduction Steps
We ran load test to benchmark S3 client and we found correlation between connection timeout and Timeout errors. The load test is running at max possible of 200 concurrency of S3 gets.
Run 1: 50ms connection timeout
Total errors: 4697.000
Total transactions: 4896.000, Successful S3 transactions 199
Run 2: 400ms connection timeout
Total errors: 3969.000
Total transactions: 9718.000, Successful S3 transactions 5701
Run 3: 1500ms connection timeout
Total errors: 3000.000
Total transactions: 42531.000, Successful S3 transactions 38419
Run 4: 2000ms connection timeout
Total errors: 3201.000
Total transactions: 33302.000, Successful S3 transactions 29478
Run 5: 3000ms connection timeout
Total errors: 0.000
Total transactions: 128115.000, Successful S3 transactions 118430
Possible Solution
By using linux ss command to observer socket overview. I found that connection created by SDK client to S3 does not have keep alive header. Note 3.5.87.213:https is s3 host as I check from here
ESTAB 0 0 [redacted_ip]:[redacted_port] 3.5.87.213:https users:(("[redacted_user]",pid=3000417,fd=73)) uid:4001 ino:277212367 sk:2b012 cgroup:unreachable:1 <->
whereas a typical connection could look like this, notice the keep alive header:
ESTAB 0 0 [redacted_ip]:[redacted_port] [redacted_ip]:[redacted_port] users:(("[redacted_user]",pid=12886,fd=397)) timer:(keepalive,1.580ms,0) uid:99 ino:407978 sk:161ea cgroup:unreachable:1 <->
I suspect this issue is due to inefficient usage of connection reuse down at hyper layer, i.e. previous active connection are closed by S3 randomly due to the lack of header. But I could be wrong.
We also observe this log line appears consistently before the transient error
State { reading: Init, writing: KeepAlive, keep_alive: Busy }
Additional Information/Context
No response
Version
├── aws-config v1.1.10
│ ├── aws-credential-types v1.1.8
│ │ ├── aws-smithy-async v1.2.1
│ │ ├── aws-smithy-runtime-api v1.3.0
│ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ ├── aws-smithy-types v1.1.8
│ │ ├── aws-smithy-types v1.1.8 (*)
│ ├── aws-runtime v1.1.9
│ │ ├── aws-credential-types v1.1.8 (*)
│ │ ├── aws-sigv4 v1.2.0
│ │ │ ├── aws-credential-types v1.1.8 (*)
│ │ │ ├── aws-smithy-eventstream v0.60.4
│ │ │ │ ├── aws-smithy-types v1.1.8 (*)
│ │ │ ├── aws-smithy-http v0.60.7
│ │ │ │ ├── aws-smithy-eventstream v0.60.4 (*)
│ │ │ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ │ │ ├── aws-smithy-types v1.1.8 (*)
│ │ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-smithy-async v1.2.1 (*)
│ │ ├── aws-smithy-eventstream v0.60.4 (*)
│ │ ├── aws-smithy-http v0.60.7 (*)
│ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-types v1.1.9
│ │ │ ├── aws-credential-types v1.1.8 (*)
│ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ │ ├── aws-smithy-types v1.1.8 (*)
│ ├── aws-sdk-sso v1.19.0
│ │ ├── aws-credential-types v1.1.8 (*)
│ │ ├── aws-runtime v1.1.9 (*)
│ │ ├── aws-smithy-async v1.2.1 (*)
│ │ ├── aws-smithy-http v0.60.7 (*)
│ │ ├── aws-smithy-json v0.60.7
│ │ │ └── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-smithy-runtime v1.2.1
│ │ │ ├── aws-smithy-async v1.2.1 (*)
│ │ │ ├── aws-smithy-http v0.60.7 (*)
│ │ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-types v1.1.9 (*)
│ ├── aws-sdk-ssooidc v1.19.0
│ │ ├── aws-credential-types v1.1.8 (*)
│ │ ├── aws-runtime v1.1.9 (*)
│ │ ├── aws-smithy-async v1.2.1 (*)
│ │ ├── aws-smithy-http v0.60.7 (*)
│ │ ├── aws-smithy-json v0.60.7 (*)
│ │ ├── aws-smithy-runtime v1.2.1 (*)
│ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-types v1.1.9 (*)
│ ├── aws-sdk-sts v1.19.0
│ │ ├── aws-credential-types v1.1.8 (*)
│ │ ├── aws-runtime v1.1.9 (*)
│ │ ├── aws-smithy-async v1.2.1 (*)
│ │ ├── aws-smithy-http v0.60.7 (*)
│ │ ├── aws-smithy-json v0.60.7 (*)
│ │ ├── aws-smithy-query v0.60.7
│ │ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-smithy-runtime v1.2.1 (*)
│ │ ├── aws-smithy-runtime-api v1.3.0 (*)
│ │ ├── aws-smithy-types v1.1.8 (*)
│ │ ├── aws-smithy-xml v0.60.7
│ │ ├── aws-types v1.1.9 (*)
│ ├── aws-smithy-async v1.2.1 (*)
│ ├── aws-smithy-http v0.60.7 (*)
│ ├── aws-smithy-json v0.60.7 (*)
│ ├── aws-smithy-runtime v1.2.1 (*)
│ ├── aws-smithy-runtime-api v1.3.0 (*)
│ ├── aws-smithy-types v1.1.8 (*)
│ ├── aws-types v1.1.9 (*)
├── aws-credential-types v1.1.8 (*)
├── aws-lc-rs v1.6.4
│ ├── aws-lc-sys v0.14.1
├── aws-sdk-s3 v1.14.0
│ ├── aws-credential-types v1.1.8 (*)
│ ├── aws-runtime v1.1.9 (*)
│ ├── aws-sigv4 v1.2.0 (*)
│ ├── aws-smithy-async v1.2.1 (*)
│ ├── aws-smithy-checksums v0.60.7
│ │ ├── aws-smithy-http v0.60.7 (*)
│ │ ├── aws-smithy-types v1.1.8 (*)
│ ├── aws-smithy-eventstream v0.60.4 (*)
│ ├── aws-smithy-http v0.60.7 (*)
│ ├── aws-smithy-json v0.60.7 (*)
│ ├── aws-smithy-runtime v1.2.1 (*)
│ ├── aws-smithy-runtime-api v1.3.0 (*)
│ ├── aws-smithy-types v1.1.8 (*)
│ ├── aws-smithy-xml v0.60.7 (*)
│ ├── aws-types v1.1.9 (*)
├── aws-sdk-sts v1.19.0 (*)
├── aws-smithy-async v1.2.1 (*)
├── aws-smithy-runtime-api v1.3.0 (*)
├── aws-types v1.1.9 (*)
├── aws-config v1.1.10 (*)
├── aws-sdk-s3 v1.14.0 (*)
├── aws-sdk-sts v1.19.0 (*)
├── aws-types v1.1.9 (*)
### Environment details (OS name and version, etc.)
Linux 5.10.210-178.852.amzn2int.x86_64 #1 SMP Tue Feb 27 17:09:26 UTC 2024 x86_64 GNU/Linux
### Logs
For complete tracing logs, please see internal thread: https://amzn-aws.slack.com/archives/C0188A52Z7X/p1712868708040879?thread_ts=1710278104.906559&cid=C0188A52Z7X
Hey @xizhem, thanks for submitting this issue. I've added it to our backlog.
I found that connection created by SDK client to S3 does not have keep alive header
To clarify, the SDK does not alter the keep-alive in the header, so this must've been happening somewhere else.
Does SDK uses HTTP1 or 2? In hyper documentation, keep-alive looks like enabled by default? https://docs.rs/hyper/latest/hyper/server/conn/http1/struct.Builder.html#method.keep_alive
SDK uses hyper defaults, which is HTTP/1.1 and negotiates to HTTP/2 if the servers wishes to do so.
keep-alive looks like enabled by default?
It is.
Asked the reproducer offline on 4/17 and waiting for reply. Adding a label as such.
Greetings! It looks like this issue hasn’t been active in longer than a week. We encourage you to check if this is still an issue in the latest release. Because it has been longer than a week since the last update on this, and in the absence of more information, we will be closing this issue soon. If you find that this is still a problem, please feel free to provide a comment or add an upvote to prevent automatic closure, or if the issue is already closed, please feel free to open a new one.