spark-redshift
spark-redshift copied to clipboard
Writing to Redshift when S3 bucket policy requires encryption
I get the following error when I try to write a dataframe to a redshift table in pyspark using df.write.
.... Caused by: com.amazon.support.exceptions.ErrorException: Amazon Invalid operation: Manifest file is not in correct json format ....
My EMR is set up with a client side encryption materials provider, and it looks like the manifest file is automatically encrypted. Is there any way around this issue?
Thanks a lot.
I am also having same issue when I tried to unload a Redshift table using Redshift connector
error: Unable to upload manifest file - S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid BEC36007FAA4C22A,ExtRid XDvccdTuoBmu79AkG9IXc3QOjt3IV0LNsFlpdc6vyKWHD82fnGX95rAEhwIP8ape4Q58ENj4KHs=,CanRetry 1
I am using below version of connector:
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-redshift_2.11</artifactId>
<version>2.0.1</version>
</dependency>
Simple test code below:
val spark = SparkSession.builder().appName("RS Connect").master(masterURL).getOrCreate()
val hadoopConf = spark.sparkContext.hadoopConfiguration
hadoopConf.set("fs.s3.enableServerSideEncryption", "true")
hadoopConf.set("fs.s3a.server-side-encryption-algorithm", "AES256")
val testRedshift = spark.sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", jdbcURL)
.option("tempdir", s3URL)
.option("query", "select * from svv_table_info")
.load()
testRedshift.createOrReplaceTempView("testRs")
spark.sql("select * from testRS").show()
svadakattu@ can you please check if issue #309 applies to your use case.
Thank you
There are two separate issues here:
- When writing to Redshift (i.e. when using
COPY
), this library is not properly respecting encryption settings when writing the manifest. This is a known limitation in this library and is our fault, not Amazon's. This is the issue reported here in #321 by @data-compute-geek. - When reading data from Redshift via
UNLOAD
, Amazon's documentation seems to imply that if we do not set theENCRYPTED
option then all files written will be encrypted usingsse-s3
. While it's true thatUNLOAD ENCRYPTED
has caveats surrounding encryption of the manifest when using sse-kms or client-provided encryption keys, there's nothing which explicitly says that the manifest will not be encrypted ifENCRYPTED
is not used. This is the issue reported in #309 by @NeilJonkers and I think it's the problem that @svadakattu is facing.
I have some thoughts on fixing the read path which I'll post in #309 to keep the two discussions cleanly separated.
I s there a work around this issue in the meanwhile, I am seeiNg the same error ?
Hi @JoshRosen - do you know if this bug has been fixed? I'm getting the following error in Spark:
-----------------------------------------------
error: S3ServiceException:Access Denied,Status 403,Error AccessDenied,Rid 14221573B79D6B82,ExtRid Lyd7DMBYKe/NyAhsoyeM+XmUo0mTvXX7YU4LofBZWewjt5YFE0zLNc2jmkt7fkMlJamndeaDj00=,CanRetry 1
code: 8001
context: Failed to initialize S3 output stream. S3 path: s3://<bucket>/temp/4618d0ce-615b-4930-b6d8-3173501f0883/12812_part_00
query: 4936961
location: s3_text_unloader.cpp:364
process: padbmaster [pid=29738]
-----------------------------------------------;
Here's how I'm adding options and doing a dataframe read (using Databricks):
sc._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption-algorithm", "SSE-KMS")
sc._jsc.hadoopConfiguration().set("fs.s3a.enableServerSideEncryption", "true")
sc._jsc.hadoopConfiguration().set("fs.s3a.server-side-encryption.key", "<kms key hex value>")
df = (spark.read
.format("com.databricks.spark.redshift")
.option("url", f"jdbc:redshift://database:1234/schema?user=username&password=password&ssl=true")
.option("query", "select <query here>'")
.option("tempdir", "s3a://<bucket>/temp")
.option("aws_iam_role", "<iam arn>")
.load()
.limit(20))
I loaded RedshiftJDBC42_no_awssdk_1_2_45_1069.jar library into the Spark cluster. I'd note that I get different errors if I leave off the ssl=true in the JDBC connection string, mistype the IAM role, or the tempdir, so I'm pretty sure it's an issue with the SSE-KMS encryption. I verified this query works through a database query console.
Hi @JoshRosen ,
Is there any possibility to pass KMS key
as option parameter and this KMS key
also can be passed to unload query since it is only required to write to S3 location.
So unload query can be look like
unload ([query]) to '[S3 path]' iam_role '[IAM]' KMS_KEY_ID '[Key]' encrypted
Also I see that hadoop configuration only consider for AWS credential which will not consider KMS key
while prepare unload query.
Hi, We are trying to use databricks spark connector for Redshift in EMR. on our cluster we are required to encrypt the files on S3 using KMS key. Due to this policy read operation in Spark is failing when Redshift driver tries to unload the data on to S3 bucket. I Could not find any way to pass KMS key to databricks spark connector. Please let me know if there is a plan to incorporate the option to pass KMS key and encrypt options for load operation call in spark.