parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

AvroParquetWriter write to s3 bucket throws data intergrity exception

Open asfimport opened this issue 3 years ago • 1 comments

 

Hi, we are trying to use org.apache.parquet.avro.AvroParquetWriter

to write parquet file to s3 bucket. The file is successfully written to s3 bucket but 

get an exception

com.amazonaws.SdkClientException: Unable to verify integrity of data upload.

The purpose is to resolve this exceptions while  The s3 bucket is encrypted with SSE-KMS not SSE-S3. 

 

It appears that the exceptions are thrown because of code blocks in the link below

https://github.com/aws/aws-sdk-java/blob/fd409dee8ae23fb8953e0bb4dbde65536a7e0514/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/AmazonS3Client.java#L1876

From amazon doc, the etag is not same as MD5 when s3 bucket is encrypted with SSE-KMS

https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html

 

The possible way is to pass MD5 in request header or set system.property to disable validation in  skipMd5CheckStrategy.skipClientSideValidationPerPutResponse as indicated in link

https://github.com/aws/aws-sdk-java/blob/99fe75a823d4b02f4e90fa0dda06a1558d5617a1/aws-java-sdk-s3/src/main/java/com/amazonaws/services/s3/internal/SkipMd5CheckStrategy.java#L42

 The issue is that I do not find a proper way to inject such configurations into AvroParquetWriter. Is this possible? If yes, can you help to show how to do it? 

 

Thanks

 

Sean

 

Reporter: sean

Note: This issue was originally created as PARQUET-2146. Please see the migration documentation for further details.

asfimport avatar May 13 '22 19:05 asfimport

Steve Loughran / @steveloughran: This error isn't related to server side encryption, which other than the etags is generally invisible. And etags are way more complicated than md5s everywhere.

What I believe it means is that the client application uploaded a block and the validation logic said "not valid"

  • which s3 connector? EMR s3:, hadoop s3a or other?

  • whose s3 store? AWS s3 or something else.

    This isn't parquet's problem, it's that of whoever wrote the s3 connector. And if it is the hadoop one, while you've got the right JIRA server, our response will be one of "does it still happen on the 3.4.x or 3.3.6 releases?".

    Looking at the v1 SDK there doesn't seem any way to disable this checking programatically, though you can disable checksum validation on read and write through system properties.

    
    com.amazonaws.services.s3.disableGetObjectMD5Validation
    com.amazonaws.services.s3.disablePutObjectMD5Validation
    

    Please tell us more. If it's through the s3a connector then move to becoming a HADOOP JIRA.

    If it is someone else's it'll have to be a WONTFIX

asfimport avatar May 29 '24 19:05 asfimport