AvroParquetWriter write to s3 bucket throws data intergrity exception
Hi, we are trying to use org.apache.parquet.avro.AvroParquetWriter
to write parquet file to s3 bucket. The file is successfully written to s3 bucket but
get an exception
com.amazonaws.SdkClientException: Unable to verify integrity of data upload.
The purpose is to resolve this exceptions while The s3 bucket is encrypted with SSE-KMS not SSE-S3.
It appears that the exceptions are thrown because of code blocks in the link below
From amazon doc, the etag is not same as MD5 when s3 bucket is encrypted with SSE-KMS
https://docs.aws.amazon.com/AmazonS3/latest/API/RESTCommonResponseHeaders.html
The possible way is to pass MD5 in request header or set system.property to disable validation in skipMd5CheckStrategy.skipClientSideValidationPerPutResponse as indicated in link
The issue is that I do not find a proper way to inject such configurations into AvroParquetWriter. Is this possible? If yes, can you help to show how to do it?
Thanks
Sean
Reporter: sean
Note: This issue was originally created as PARQUET-2146. Please see the migration documentation for further details.
Steve Loughran / @steveloughran: This error isn't related to server side encryption, which other than the etags is generally invisible. And etags are way more complicated than md5s everywhere.
What I believe it means is that the client application uploaded a block and the validation logic said "not valid"
-
which s3 connector? EMR s3:, hadoop s3a or other?
-
whose s3 store? AWS s3 or something else.
This isn't parquet's problem, it's that of whoever wrote the s3 connector. And if it is the hadoop one, while you've got the right JIRA server, our response will be one of "does it still happen on the 3.4.x or 3.3.6 releases?".
Looking at the v1 SDK there doesn't seem any way to disable this checking programatically, though you can disable checksum validation on read and write through system properties.
com.amazonaws.services.s3.disableGetObjectMD5Validation com.amazonaws.services.s3.disablePutObjectMD5ValidationPlease tell us more. If it's through the s3a connector then move to becoming a HADOOP JIRA.
If it is someone else's it'll have to be a WONTFIX