aws-sdk-java icon indicating copy to clipboard operation
aws-sdk-java copied to clipboard

GetObjectRequest in S3 should support final bytes as a Range header value

Open bbranan opened this issue 6 years ago • 9 comments

According to https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35, a Range header may include a single negative value to indicate the last X bytes in a file should be retrieved. For example bytes=-500 is a valid Range value for the final 500 bytes in a file.

This Range header option is currently supported by S3, as verified through the AWS S3 CLI.

Currently, GetObjectRequest includes setRange(long start) and setRange(long start, long end), which supports Range values like bytes=100- and bytes=100-200, however, there is no way to provide a Range value in GetObjectRequest which results in "bytes=-100", despite the fact that this is a valid value which is already supported by S3.

bbranan avatar Apr 13 '18 18:04 bbranan

Makes sense, we'd have to see if we can make this in a backwards compatible way. In the meantime I think you should be able to workaround this by doing something like the following.

        GetObjectRequest req = new GetObjectRequest("bucket", "key");
        req.putCustomRequestHeader("Range", "-500");
        amazonS3.getObject(req);

shorea avatar Apr 13 '18 19:04 shorea

The simplest way to be backwards compatible here would likely be to add a new method, perhaps something like setRangeEnd(long end), which results in the expected header value.

Thanks for the work around, I will use that strategy for now, though I believe the call would need to be

req.putCustomRequestHeader("Range", "bytes=-500");

bbranan avatar Apr 13 '18 19:04 bbranan

Yes good catch.

shorea avatar Apr 14 '18 00:04 shorea

Using the suggested work around results in the following error:

com.amazonaws.SdkClientException: Unable to verify integrity of data download.  Client calculated content hash didn't match hash calculated by Amazon S3.  The data may be corrupt.
	com.amazonaws.services.s3.internal.DigestValidationInputStream.validateMD5Digest(DigestValidationInputStream.java:79)
	com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:61)
	com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:72)

By default, when a getObject() request is made, the checksum of the retrieved file is verified against the complete file checksum by the client. Of course, the subset of bytes retrieved with a Range request will not have the expected checksum. When GetObjectRequest.setRange() is used, the checksum validation step is disabled (based on an internal getRange() check). Setting Range as a custom header does not result in the checksum validation being disabled, so it fails consistently.

This update to the work around allows it to work by setting the range (thus disabling the checksum check), then overwriting the Range header value with the custom header:

    GetObjectRequest req = new GetObjectRequest("bucket", "key");
    req.setRange(0);
    req.putCustomRequestHeader("Range", "bytes=-500");
    amazonS3.getObject(req);

Unfortunately, this is based on the assumption that the internal implementation will continue to override the Range value with the custom header. That does not seem like a good assumption to make.

bbranan avatar Apr 18 '18 21:04 bbranan

You can disable md5 checks for GET request using the System Property. https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/internal/SkipMd5CheckStrategy.java#L34.

Note: this will disable md5 checks for ALL get requests.

zoewangg avatar Apr 18 '18 22:04 zoewangg

Thanks for the pointer @zoewangg. Unfortunately, the majority of requests I will be making are full-object requests, and I really do want md5 checks to occur for those transfers. I'm just looking for a way to disable the md5 checks specifically for Range-limited requests.

bbranan avatar Apr 19 '18 12:04 bbranan

This will also be useful for file formats like ORC and Parquet that want to read the file footer first.

omalley avatar Dec 06 '19 16:12 omalley

@omalley I have exactly this use case. Did you find an acceptable work around?

kyprifog avatar Mar 18 '20 03:03 kyprifog

I was able to just pull the content length from the header and then have a second call using that content length to pull the footer, although I am guessing this issue is about being able to do this without doing 2 calls

kyprifog avatar Mar 20 '20 15:03 kyprifog

We don't have plans to support this in v1.

We are closing stale v1 issues before going into Maintenance Mode, so if this issue is still relevant in v2 please open a new issue in the v2 repo.

Reference:

  • Announcing end-of-support for AWS SDK for Java v1.x effective December 31, 2025 - blog post

debora-ito avatar Jul 17 '24 01:07 debora-ito

This issue is now closed.

Comments on closed issues are hard for our team to see. If you need more assistance, please open a new issue that references this one.

github-actions[bot] avatar Jul 17 '24 01:07 github-actions[bot]