nodejs-storage icon indicating copy to clipboard operation
nodejs-storage copied to clipboard

CONTENT_DOWNLOAD_MISMATCH with successful file download

Open taiyokato opened this issue 5 years ago • 18 comments

Environment details

  • OS: Windows 10 Pro 1809
  • Node.js version: v10.15.3 LTS
  • npm version: 6.4.1
  • @google-cloud/storage version: 2.5.0

Steps to reproduce

  1. Prepare a bucket with Cloud KMS Customer managed Key encryption. All files are private.
  2. Upload an JPEG image file using:
function UploadFile() {
    storage.bucket("somebucket").upload("somefolder/someimage.jpg", {
        gzip: true,
        destination: "somefolder/someimage.jpg",
    }).then(results => {
        console.log("upload OK");
    })
    .catch(err => {
        console.error("Error: ", err);
    });
}
UploadFile()
// Upload works fine.
  1. Try downloading the img file using:
function DownloadFile() {    
    const file = storage.bucket("somebucket").file("somefolder/someimage.jpg");
    
    file.download({
        destination: "someimage.jpg",
        // validation: false, // Why even disable validation?
    }).then(res => {
        console.log("DL OK");
    }).catch(err => {
        console.error(err); // Throws CONTENT_DOWNLOAD_MISMATCH
    })
}
DownloadFile()
  1. Correct decompressed file is downloaded AND exception is thrown: code=CONTENT_DOWNLOAD_MISMATCH message=The downloaded data did not match the data from the server. To be sure the content is the same, you should download the file again.

I've read through Issue 566, but seems like not a solution.

storage.bucket("somebucket").file("somefolder/someimage.jpg").download({validation: false}); works, but there's no reason to or should disable validation.

To make sure if the hashes are actually a mismatch, I ran a local md5sum check on the downloaded and original image files.

$ md5sum downloaded.jpg original.jpg
a045a2e8e6b8d84aa8a319bcdba05419  downloaded.jpg
a045a2e8e6b8d84aa8a319bcdba05419  original.jpg

Downloaded file is a match to the original file.

BTW, This problem doesn't happen if the image is not gzipped on upload.

Thanks

taiyokato avatar May 13 '19 01:05 taiyokato

Validation error because of crc32c check failure. https://github.com/googleapis/nodejs-storage/blob/61eeb64a6bca194361aacf2a312d27d0e6d63b35/src/file.ts#L1372

  • On upload, storage client calculates crc32c after gzip compression. Google storage service is also calculating crc32c on gzip data and send back crc32c value in upload response for validation.
  • While on download google storage client receiving uncompress data but header x-goog-hash has the crc32c value of gzip compressed data. (same as generated on upload ). Here crc32c value is mismatching because the client is calculating the crc32c value on uncompressed data.

Same error if validation set to md5

const file = storage.bucket("Bucket Name").file("file path on google storage");
file.download({
  destination: "downloaded.jpg",
  validation: 'md5'
}).then(res => {
  console.log("DL OK");
}).catch(err => {
  console.error(err);
})

So if gzip response header not found in download response then we should ignore crc32c check on uncompressed data.

jiren avatar Jun 04 '19 13:06 jiren

@taiyokato I have reproduced this when using already-compressed content, specifically a PNG. In your example, you use a JPEG, which is also already compressed. The Storage docs advise against this due to "undesired behaviors" (see Using gzip on compressed objects).

In this case, the server is responding with the actual image contents without the gzip wrapper. We don't run the decompression, because it isn't compressed data we are receiving. Validation still runs, however, and that means the hashes we compute are not against the gzipped data value, but the actual data.

@jiren Currently, we run validation on all downloads. Can you clarify the conditions we should bypass the validation?

stephenplusplus avatar Jul 19 '19 19:07 stephenplusplus

@stephenplusplus

Ah, that clears the mystery. Thanks for looking into it (and the link).

Should I keep this issue open or close it?

taiyokato avatar Jul 21 '19 09:07 taiyokato

Here is the current condition. https://github.com/googleapis/nodejs-storage/blob/71a4f59343bbe9b0c00ebdf68f6fdf9d214727cc/src/file.ts#L1300-L1303

This condition can be change as per data is compress or not. If the gzip header is present then only validate CRC and MD5.

if (shouldRunValidation && isCompressed) { 
  validateStream = hashStreamValidation({ crc32c, md5 }); 
  throughStreams.push(validateStream); 
} else { 
  crc32c = false 
}

Same thing is implemented in go storage client.

if length != 0 && !res.Uncompressed && !uncompressedByServer(res) {

https://github.com/googleapis/google-cloud-go/blob/71971b35976fc2f904ed2772536790a5458d9996/storage/reader.go#L205

jiren avatar Jul 22 '19 12:07 jiren

@AVaksman I'll hand this over to you for now, as I believe this is blocked on the outcome of your investigation with the Storage team: https://docs.google.com/document/d/1jz91hfD5AJ8ghXnTPSnCuBa4NdkOPq085y9hBBQFu8M. I'm happy to do any implementation work if necessary when the time comes.

stephenplusplus avatar Sep 03 '19 14:09 stephenplusplus

@AVaksman is there a game plan for how to address this?

stephenplusplus avatar Nov 22 '19 20:11 stephenplusplus

@AVaksman @stephenplusplus - this bug has just caused us an issue in our production environment which we have had to workaround, any eta on a fix?

danielwhatmuff avatar Feb 01 '20 14:02 danielwhatmuff

I think what we'd like to end up doing here is skipping validation if the file has the metadata Content-Encoding: gzip and the file is served decompressed by the server.

We would insert that logic after the getMetadata() call, which is when the client makes a second API call to get the checksums from the server.

@frankyn Does this temporary fix make sense?

jkwlui avatar Feb 05 '20 00:02 jkwlui

That sounds like a good plan in the mean time.

Please move forward it and add a note that this is only temporary.

frankyn avatar Feb 05 '20 18:02 frankyn

Opening this so we track the other part of the fix. Will mark as external as it won't likely happen in this repo.

crwilcox avatar Feb 06 '20 23:02 crwilcox

Any eta on a fix for this? Upgraded to 4.3.1 but see the same issue.

danielwhatmuff avatar Feb 13 '20 21:02 danielwhatmuff

We believe the issue should have been fixed with 4.3.1, so that's weird that it's still showing up for you @danielwhatmuff. I'd like to get more information -

Could you help us by providing:

  • sample code that reproduces the issue
  • the file type that's causing the issue?

I appreciate your patience and working with us to resolve this!

jkwlui avatar Feb 18 '20 22:02 jkwlui

@frankyn do you know where this issue stands? I don't think there was anything left for us to do from this library.

stephenplusplus avatar Sep 17 '20 15:09 stephenplusplus

I'm running into this with some uploaded .png and .jpegs. I do not GZIP on upload at all. I tried the MD5 check to see if that cleared it up, but it didn't. The odd thing is that it only breaks when opening multiple read streams for different files to zip them. When I attempt to read the one file in a separate file, it works. For the life of me, the two sets of code appear the same.

wayjake avatar Sep 21 '20 20:09 wayjake

We are still working with the backend team to resolve this issue correctly. We are working on a workaround and will post an update when available.

frankyn avatar Oct 07 '20 17:10 frankyn

Is there an update on this? The issue is affecting us.

kirillgroshkov avatar Sep 22 '21 23:09 kirillgroshkov

News leading to closure of this issue: As of end of this week Cloud Storage API will always respect Accept-Encoding: gzip.

I'm closing this issue as we can now safely validate checksum because GCS respects not decompressing data server side before sending back to customers.

Thanks for your patience

frankyn avatar Nov 19 '21 01:11 frankyn

Spoke to soon, Rolled back the change so we will need to follow-up again when we have an update. Apologies, I jinxed it.

frankyn avatar Nov 19 '21 20:11 frankyn