spark-redshift
spark-redshift copied to clipboard
S3 Problem. Need AWS4-HMAC-SHA256 authorization mechanism
I am using:
- Spark 2.1.0 (Python API)
- spark-redshift-3.0.0-preview
- aws-java-sdk-1.7.4
- hadoop-aws-2.7.1
- s3a://
Redshift Cluster and S3 Bucket are in eu-central-1 region and therefore only v4 is supported.
When I do the following:
spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
I can open files directly from S3 with s3a without any issues like this:
test = spark.sparkContext.textFile('s3a://bucket/data')
print(test.take(5))
But using the spark-redshift datasource like:
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("dbtable", "table name") \
.option("tempdir", "s3a://tmpdata-bucket/") \
.option("aws_iam_role", "arn:aws:iam::123456789000:role/redshift_iam_role") \
.load()
df.show()
throws this error:
Py4JJavaError: An error occurred while calling o352.showString.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 4CCE719987F2C522, AWS Error Code: InvalidRequest, AWS Error Message: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256., S3 Extended Request ID: Yk5oU9D9Kq8ObSiKQXnE/XSGKnP6zztmu6h+yuI9J25wCIprlB4vS4mdXLHoj386kvjvr6P6e9g=
at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:148)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:336)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:384)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:383)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:464)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:379)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:332)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2791)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Does anyone have a solution for this? I'm facing the same issue.
I am also facing same issue. I am using spark Java Apis
-- For anyone else facing this issue -- -- Workaround in Java (Scala will be much simpler)--
import com.databricks.spark.redshift.DefaultSource;
import org.apache.spark.sql.sources.BaseRelation;
final String endpoint = "<your end point>"
DefaultSource source = new DefaultSource(new JDBCWrapper(),
new AbstractFunction1<AWSCredentialsProvider, AmazonS3Client>() {
@Override
public AmazonS3Client apply(AWSCredentialsProvider provider) {
AmazonS3Client client = new AmazonS3Client(provider);
client.setEndpoint(endpoint);
return client;
}
});
SparkSession sparkSession = ...
BaseRelation br = source.createRelation(sparkSession.sqlContext(), JavaConverters.mapAsScalaMapConverter(configs).asScala()
.toMap(Predef.<Tuple2<String, String>>conforms()));
Dataset<Row> row = sparkSession.baseRelationToDataFrame(br);
Hey any solution for this issue? I am also facing same issue.