spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

S3 Problem. Need AWS4-HMAC-SHA256 authorization mechanism

Open pixelsebi opened this issue 8 years ago • 4 comments

I am using:

  • Spark 2.1.0 (Python API)
  • spark-redshift-3.0.0-preview
  • aws-java-sdk-1.7.4
  • hadoop-aws-2.7.1
  • s3a://

Redshift Cluster and S3 Bucket are in eu-central-1 region and therefore only v4 is supported.

When I do the following:

spark.sparkContext.setSystemProperty("com.amazonaws.services.s3.enableV4", "true")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.endpoint", "s3.eu-central-1.amazonaws.com")
spark.sparkContext._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

I can open files directly from S3 with s3a without any issues like this:

test = spark.sparkContext.textFile('s3a://bucket/data')
print(test.take(5))

But using the spark-redshift datasource like:

df = spark.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
    .option("dbtable", "table name") \
    .option("tempdir", "s3a://tmpdata-bucket/") \
    .option("aws_iam_role", "arn:aws:iam::123456789000:role/redshift_iam_role") \
    .load()
df.show()

throws this error:

Py4JJavaError: An error occurred while calling o352.showString.
: com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 400, AWS Service: Amazon S3, AWS Request ID: 4CCE719987F2C522, AWS Error Code: InvalidRequest, AWS Error Message: The authorization mechanism you have provided is not supported. Please use AWS4-HMAC-SHA256., S3 Extended Request ID: Yk5oU9D9Kq8ObSiKQXnE/XSGKnP6zztmu6h+yuI9J25wCIprlB4vS4mdXLHoj386kvjvr6P6e9g=
	at com.amazonaws.http.AmazonHttpClient.handleErrorResponse(AmazonHttpClient.java:798)
	at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:421)
	at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
	at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:1111)
	at com.amazonaws.services.s3.AmazonS3Client.getObject(AmazonS3Client.java:984)
	at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:148)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:336)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$11.apply(DataSourceStrategy.scala:336)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:384)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:383)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:464)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:379)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:332)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:62)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:77)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:74)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:74)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:66)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:92)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:79)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:84)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:84)
	at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2791)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2112)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2327)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:248)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:280)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:214)
	at java.lang.Thread.run(Thread.java:745)

pixelsebi avatar Mar 05 '17 20:03 pixelsebi

Does anyone have a solution for this? I'm facing the same issue.

sudev avatar Aug 21 '17 14:08 sudev

I am also facing same issue. I am using spark Java Apis

blue4209211 avatar Oct 07 '17 09:10 blue4209211

-- For anyone else facing this issue -- -- Workaround in Java (Scala will be much simpler)--

import com.databricks.spark.redshift.DefaultSource;
import org.apache.spark.sql.sources.BaseRelation;

final String endpoint = "<your end point>"
		DefaultSource source = new DefaultSource(new JDBCWrapper(),
				new AbstractFunction1<AWSCredentialsProvider, AmazonS3Client>() {

					@Override
					public AmazonS3Client apply(AWSCredentialsProvider provider) {
						AmazonS3Client client = new AmazonS3Client(provider);
						client.setEndpoint(endpoint);
						return client;
					}
				});
SparkSession sparkSession = ...
BaseRelation br = source.createRelation(sparkSession.sqlContext(), JavaConverters.mapAsScalaMapConverter(configs).asScala()
				.toMap(Predef.<Tuple2<String, String>>conforms()));
		Dataset<Row> row = sparkSession.baseRelationToDataFrame(br);

blue4209211 avatar Oct 07 '17 11:10 blue4209211

Hey any solution for this issue? I am also facing same issue.

ravi-ranjan25 avatar Aug 08 '21 17:08 ravi-ranjan25