aws-sdk-java-v2 icon indicating copy to clipboard operation
aws-sdk-java-v2 copied to clipboard

Reads on S3 are slower compared to V1 SDK on V2 SDK

Open HarshitGupta11 opened this issue 1 year ago • 4 comments

Describe the bug

We can see that the reads on the new SDK are slower than the v1 sdk. On profiling a long running job it was revealed that new SDK has checksum computation on the InputStream which leads to a slower read performance. The ChecksumValidatingInputStream is initialized by default and the client config overrides don't work for it.

    private static <InputT extends SdkRequest, OutputT extends SdkResponse> ExecutionAttributes mergeExecutionAttributeOverrides(
        ExecutionAttributes executionAttributes,
        ExecutionAttributes clientOverrideExecutionAttributes,
        ExecutionAttributes requestOverrideExecutionAttributes) {


        executionAttributes.putAbsentAttributes(requestOverrideExecutionAttributes);
        executionAttributes.putAbsentAttributes(clientOverrideExecutionAttributes);

        return executionAttributes;
    }

The checksum computation takes around ~10% percent of the time of the overall job, meaning that the reads will in itself be much slower in actuality.
e.g. with new sdk:

Download Summary
================
Data size 10,737,418,240 bytes
Download duration 0:03:40.848

with older sdk:

Download Summary
================
Data size 10,737,418,240 bytes
Download duration 0:02:52.134

Attaching the profiling screenshots of a job with bandwidth command run at a scale of 10G, they are html flame graphs and can be opened using chrome. flamegraph_v1_sdk.txt flamegraph_v2_sdk.txt

Expected Behavior

  1. We should be able to turn off the ChecksumValidatingInputStream for reads
  2. We should be aware of such regressions on the documentation itself when the core behaviour is changed on the service side
  3. We should not see slow downs even if a new feature is added to the existing functions.

Current Behavior

Mentioned Above

Reproduction Steps

Get a distribution of hadoop from https://hadoop.apache.org/releases.html compatible with v2 sdk Get an older distribution of hadoop from: https://hadoop.apache.org/release/3.3.4.html Get the latest release of cloudstore: https://github.com/steveloughran/cloudstore/releases Configure the core-site.xml with your S3 access keys Run the bandwidth command with the hadoop jar option from both the hadoop versions.

Possible Solution

Mentioned Above.

Additional Information/Context

No response

AWS Java SDK version used

2.21.46

JDK version used

1.8

Operating System and version

CentOS7

HarshitGupta11 avatar Jan 02 '24 09:01 HarshitGupta11

@debora-ito Any update on this?

dyutishb avatar Feb 20 '24 17:02 dyutishb