Reads on S3 are slower compared to V1 SDK on V2 SDK
Describe the bug
We can see that the reads on the new SDK are slower than the v1 sdk. On profiling a long running job it was revealed that new SDK has checksum computation on the InputStream which leads to a slower read performance. The ChecksumValidatingInputStream is initialized by default and the client config overrides don't work for it.
private static <InputT extends SdkRequest, OutputT extends SdkResponse> ExecutionAttributes mergeExecutionAttributeOverrides(
ExecutionAttributes executionAttributes,
ExecutionAttributes clientOverrideExecutionAttributes,
ExecutionAttributes requestOverrideExecutionAttributes) {
executionAttributes.putAbsentAttributes(requestOverrideExecutionAttributes);
executionAttributes.putAbsentAttributes(clientOverrideExecutionAttributes);
return executionAttributes;
}
The checksum computation takes around ~10% percent of the time of the overall job, meaning that the reads will in itself be much slower in actuality.
e.g.
with new sdk:
Download Summary
================
Data size 10,737,418,240 bytes
Download duration 0:03:40.848
with older sdk:
Download Summary
================
Data size 10,737,418,240 bytes
Download duration 0:02:52.134
Attaching the profiling screenshots of a job with bandwidth command run at a scale of 10G, they are html flame graphs and can be opened using chrome. flamegraph_v1_sdk.txt flamegraph_v2_sdk.txt
Expected Behavior
- We should be able to turn off the ChecksumValidatingInputStream for reads
- We should be aware of such regressions on the documentation itself when the core behaviour is changed on the service side
- We should not see slow downs even if a new feature is added to the existing functions.
Current Behavior
Mentioned Above
Reproduction Steps
Get a distribution of hadoop from https://hadoop.apache.org/releases.html compatible with v2 sdk Get an older distribution of hadoop from: https://hadoop.apache.org/release/3.3.4.html Get the latest release of cloudstore: https://github.com/steveloughran/cloudstore/releases Configure the core-site.xml with your S3 access keys Run the bandwidth command with the hadoop jar option from both the hadoop versions.
Possible Solution
Mentioned Above.
Additional Information/Context
No response
AWS Java SDK version used
2.21.46
JDK version used
1.8
Operating System and version
CentOS7
@debora-ito Any update on this?