spark-redshift
spark-redshift copied to clipboard
Cannot connect Spark to S3 in Scala in Zeppelin notebook
Here is the error message. Anyone can help?
java.lang.SecurityException: class "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"'s signer information does not match signer information of other classes in the same package
My scala code is as below:
import org.apache.spark.sql._
val sqlContext = spark
// Get some data from a Redshift table
val df: DataFrame = sqlContext.read
.format("com.databricks.spark.redshift")
.option("url", "jdbc:redshift://******")
.option("aws_iam_role", "******")
.option("dbtable", "******")
.option("tempdir", "s3://******")
.load()
df.printSchema() // This works fine as it prints the df schema
df.take(10) // This errors out and drove me mad.
Below is the full error message.
java.lang.SecurityException: class "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"'s signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:898) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668) at java.lang.ClassLoader.defineClass(ClassLoader.java:761) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$com$databricks$spark$redshift$AWSCredentialsUtils$$loadFromURI$3.apply(AWSCredentialsUtils.scala:103) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$com$databricks$spark$redshift$AWSCredentialsUtils$$loadFromURI$3.apply(AWSCredentialsUtils.scala:103) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.redshift.AWSCredentialsUtils$.com$databricks$spark$redshift$AWSCredentialsUtils$$loadFromURI(AWSCredentialsUtils.scala:101) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$load$1.apply(AWSCredentialsUtils.scala:65) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$load$1.apply(AWSCredentialsUtils.scala:65) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.redshift.AWSCredentialsUtils$.load(AWSCredentialsUtils.scala:65) at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:90) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:393) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:155) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:155) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1194) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) ... 74 elided
I use this in AWS EMR with Zepplin installed by default.
I guess the reason is I don't configure the AWS Credentials. But I don't know how to do it since it uses another use zeppelin
rather than ec2-user
. So how to grant AWS EMR access to S3.
Here are my tries and all failed:
- I followed instruction on https://docs.aws.amazon.com/cli/latest/userguide/cli-config-files.html and put these files in both hadoop and zeppelin
- Use .option("temporary_aws_access_key_id" and .option("temporary_aws_secret_access_key", to specify the keys when create the dataframe.
- option
forward_spark_s3_credentials
Hello @iShiBin,
We had a similar problem at work and fixed this editing the configurations.json
used by aws emr create-cluster
.
...
{
"Classification": "zeppelin-env",
"Properties": {
},
"Configurations": [
{
"Classification": "export",
"Properties": {
"ZEPPELIN_CLASSPATH": "$ZEPPELIN_CLASSPATH:/tmp/your-app.jar",
"CLASSPATH": "$CLASSPATH:/tmp/your-app.jar",
...
Hope it helps.
@mycaule I am not using it anymore, but I will give it a try if I will use it in the future. Thanks for the feedback though.
Folks, did you ever discover the root cause of this? I am suffering the same issue now in an SBT project
It may be a problem with the classpath of the JAR you are generating. Here is another fix inside Zeppelin.
%dep
z.load(“/path/to/your/app.jar”)
Also please make sure your EC2 machine has sufficient IAM credentials to fetch data on S3.
Thanks for getting back to me. This is not a Zeppelin context, its a Scala SBT project
Make sur you use the latest version avec aws sdk and your code authenticating to their service is up to date too.