aws-sdk-rust icon indicating copy to clipboard operation
aws-sdk-rust copied to clipboard

S3 get_object() call sometimes returns partial response

Open exabytes18 opened this issue 1 year ago • 8 comments
trafficstars

Describe the bug

We are occasionally seeing S3 get_object() calls return incomplete data.

async fn fetch_object(&self, s3_file_reference: &S3FileReference) -> Result<Value> {
    let response =
        self.s3_client
            .get_object()
            .bucket(s3_file_reference.s3_bucket_name.clone())
            .key(s3_file_reference.s3_key.clone())
            .send()
            .await?;

    let bytes = response.body.collect().await?.into_bytes();
    if let Some(content_length) = response.content_length {
        let num_bytes_received = bytes.len() as i64;
        if num_bytes_received != content_length {
            return Err(anyhow!("Downloaded s3://{}/{}; expected content length: {}, got: {}",
                s3_file_reference.s3_bucket_name,
                s3_file_reference.s3_key,
                content_length,
                num_bytes_received));
        }
    }

    Ok(serde_json::from_slice(&bytes)?)
}

Example error:

Downloaded s3://[REDACTED]/[REDACTED]/1709094648119.json; expected content length: 2659912, got: 503174

Expected Behavior

get_object() should return the full object.

Current Behavior

get_object() sometimes returns incomplete data.

Reproduction Steps

I do not have a minimum repro script available right now.

Possible Solution

No response

Additional Information/Context

No response

Version

│   │   ├── aws-config v1.1.7
│   │   │   ├── aws-credential-types v1.1.7
│   │   │   │   ├── aws-smithy-async v1.1.7
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7
│   │   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-runtime v1.1.7
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-sigv4 v1.1.7
│   │   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-eventstream v0.60.4
│   │   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-http v0.60.6
│   │   │   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-types v1.1.7
│   │   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-sdk-sso v1.15.0
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-json v0.60.6
│   │   │   │   │   └── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime v1.1.7
│   │   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-types v1.1.7 (*)
│   │   │   ├── aws-sdk-ssooidc v1.15.0
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-types v1.1.7 (*)
│   │   │   ├── aws-sdk-sts v1.15.0
│   │   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   │   ├── aws-smithy-query v0.60.6
│   │   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   │   ├── aws-smithy-xml v0.60.6
│   │   │   │   ├── aws-types v1.1.7 (*)
│   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-types v1.1.7 (*)
│   │   ├── aws-sdk-s3 v1.17.0
│   │   │   ├── aws-credential-types v1.1.7 (*)
│   │   │   ├── aws-runtime v1.1.7 (*)
│   │   │   ├── aws-sigv4 v1.1.7 (*)
│   │   │   ├── aws-smithy-async v1.1.7 (*)
│   │   │   ├── aws-smithy-checksums v0.60.6
│   │   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-smithy-eventstream v0.60.4 (*)
│   │   │   ├── aws-smithy-http v0.60.6 (*)
│   │   │   ├── aws-smithy-json v0.60.6 (*)
│   │   │   ├── aws-smithy-runtime v1.1.7 (*)
│   │   │   ├── aws-smithy-runtime-api v1.1.7 (*)
│   │   │   ├── aws-smithy-types v1.1.7 (*)
│   │   │   ├── aws-smithy-xml v0.60.6 (*)
│   │   │   ├── aws-types v1.1.7 (*)

Environment details (OS name and version, etc.)

Ubuntu 22.04

Logs

No response

exabytes18 avatar Feb 28 '24 05:02 exabytes18

Thank you for reporting this. Can you enable trace debug logging, reproduce the issue, and then post the logs here? The log will probably be our next best bet without a reproducer.

ysaito1001 avatar Mar 01 '24 18:03 ysaito1001

Was finally able to capture some TRACE-level logs around the time we observe the partial response.

log.txt

exabytes18 avatar Mar 12 '24 23:03 exabytes18

Do you have a sense of what "sometimes" is here? 1%? 0.1%? 0.001%?

Trying to figure out what we might need to try to do to reproduce this on our side.

Are you connecting directly to S3? Is it possible there is an intermediate proxy somewhere that may be truncating the stream?

It seems like at a minimum, the SDK should be doing the validation you're doing so we can return an error if this case is detected.

I'm going to keep digging into this and I'll let you know what I find.

rcoh avatar Mar 13 '24 14:03 rcoh

Also, I'd suggest turning on request checksums in your get_object() call:

        .checksum_mode(aws_sdk_s3::types::ChecksumMode::Enabled)

rcoh avatar Mar 13 '24 15:03 rcoh

Are you doing anything notable when constructing your client? I've been unable to reproduce this failure mode—we actually have existing tests for this behavior.

  • If the connection stays open, it should timeout (or hang if you don't timeout the read as you're doing here).
  • If the connection is closed on the server side, you get an EOF error

From the logs, I can see you are using HTTP 1 which exactly matches the scenario we're testing.

We can add an additional safeguard to error out in this case—I think that would be prudent anyway.

rcoh avatar Mar 13 '24 17:03 rcoh

Do you have a sense of what "sometimes" is here? 1%? 0.1%? 0.001%?

We don't have good metrics on this, but I'll estimate this somewhere at <0.1%.

Are you connecting directly to S3? Is it possible there is an intermediate proxy somewhere that may be truncating the stream?

These processes are running in EC2 and hitting S3 directly. There's very little infrastructure in the way -- no ECS/EKS, no proxies, about as simple as you can get. Additionally, our VPCs and route tables are configured with the S3 Gateway Endpoints.

Notably, I think we typically see these "partial responses" with cross-region S3 calls. E.g. the log I posted above was an EC2 instance in eu-central-1 making a GetObject call to a us-east-1 bucket. This might be significant.

Also, I'd suggest turning on request checksums in your get_object() call

Excellent callout. We'll add this.

Are you doing anything notable when constructing your client?

Nothing specifically. Our code looks like this:

let sdk_config = aws_config::defaults(BehaviorVersion::latest())
    .region(Region::new(region.clone()))
    .load()
    .await;

let s3_client = Client::new(&sdk_config);

Perhaps moderately interesting, we do instantiate a fresh tokio runtime essentially per GetObject attempt. Something analogous to the following:

fn some_synchronous_function_which_is_called_infrequently(&self, s3_file_reference: &S3FileReference) -> Result<Value> {
    tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()?
        .block_on(async move {
            self.fetch_object(&s3_file_reference).await
        })
}

I've been unable to reproduce this failure mode

Yeah, our own retry logic is able to resolve the issue. That is, our code makes the same exact GetObject call, using the same S3Client, in the exact same tokio runtime, and the second call succeeds. Thus, I would guess that reproducing it is tricky.

exabytes18 avatar Mar 13 '24 21:03 exabytes18

Perhaps moderately interesting, we do instantiate a fresh tokio runtime essentially per GetObject attempt. Something analogous to the following:

That is almost certainly the problem. Because the rust client contains a connection pool, weird things happen when you drop the runtime that's running the futures.

I'm working with Sean to fix this on Hyper's side but I'm also adding a middleware to the Rust SDK to validate content length.

In the mean time, share the runtime between clients :-)

rcoh avatar Mar 14 '24 17:03 rcoh

Fix here: https://github.com/smithy-lang/smithy-rs/pull/3491

rcoh avatar Mar 14 '24 18:03 rcoh