spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

Cannot connect Spark to S3 in Scala in Zeppelin notebook

Open iShiBin opened this issue 6 years ago • 8 comments

Here is the error message. Anyone can help?

java.lang.SecurityException: class "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"'s signer information does not match signer information of other classes in the same package

My scala code is as below:

import org.apache.spark.sql._

val sqlContext = spark

// Get some data from a Redshift table
val df: DataFrame = sqlContext.read
    .format("com.databricks.spark.redshift")
    .option("url", "jdbc:redshift://******")
    .option("aws_iam_role", "******")
    .option("dbtable", "******")
    .option("tempdir", "s3://******")
    .load()

df.printSchema() // This works fine as it prints the df schema

df.take(10) // This errors out and drove me mad.

Below is the full error message.

java.lang.SecurityException: class "com.amazonaws.auth.DefaultAWSCredentialsProviderChain"'s signer information does not match signer information of other classes in the same package at java.lang.ClassLoader.checkCerts(ClassLoader.java:898) at java.lang.ClassLoader.preDefineClass(ClassLoader.java:668) at java.lang.ClassLoader.defineClass(ClassLoader.java:761) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) at java.net.URLClassLoader.defineClass(URLClassLoader.java:467) at java.net.URLClassLoader.access$100(URLClassLoader.java:73) at java.net.URLClassLoader$1.run(URLClassLoader.java:368) at java.net.URLClassLoader$1.run(URLClassLoader.java:362) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:361) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$com$databricks$spark$redshift$AWSCredentialsUtils$$loadFromURI$3.apply(AWSCredentialsUtils.scala:103) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$com$databricks$spark$redshift$AWSCredentialsUtils$$loadFromURI$3.apply(AWSCredentialsUtils.scala:103) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.redshift.AWSCredentialsUtils$.com$databricks$spark$redshift$AWSCredentialsUtils$$loadFromURI(AWSCredentialsUtils.scala:101) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$load$1.apply(AWSCredentialsUtils.scala:65) at com.databricks.spark.redshift.AWSCredentialsUtils$$anonfun$load$1.apply(AWSCredentialsUtils.scala:65) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.redshift.AWSCredentialsUtils$.load(AWSCredentialsUtils.scala:65) at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:90) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:338) at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:337) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:393) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:333) at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:155) at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:155) at scala.collection.Iterator$class.foreach(Iterator.scala:742) at scala.collection.AbstractIterator.foreach(Iterator.scala:1194) at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155) at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1194) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75) at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:396) at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93) at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72) at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68) at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77) at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3248) at org.apache.spark.sql.Dataset.head(Dataset.scala:2484) at org.apache.spark.sql.Dataset.take(Dataset.scala:2698) ... 74 elided

iShiBin avatar Jul 23 '18 23:07 iShiBin

I use this in AWS EMR with Zepplin installed by default.

I guess the reason is I don't configure the AWS Credentials. But I don't know how to do it since it uses another use zeppelin rather than ec2-user. So how to grant AWS EMR access to S3.

iShiBin avatar Jul 23 '18 23:07 iShiBin

Here are my tries and all failed:

  • I followed instruction on https://docs.aws.amazon.com/cli/latest/userguide/cli-config-files.html and put these files in both hadoop and zeppelin
  • Use .option("temporary_aws_access_key_id" and .option("temporary_aws_secret_access_key", to specify the keys when create the dataframe.
  • option forward_spark_s3_credentials

iShiBin avatar Jul 23 '18 23:07 iShiBin

Hello @iShiBin,

We had a similar problem at work and fixed this editing the configurations.json used by aws emr create-cluster.

...
  {
    "Classification": "zeppelin-env",
    "Properties": {
    },
    "Configurations": [
    {
      "Classification": "export",
      "Properties": {
        "ZEPPELIN_CLASSPATH": "$ZEPPELIN_CLASSPATH:/tmp/your-app.jar",
        "CLASSPATH": "$CLASSPATH:/tmp/your-app.jar",
     ...

Hope it helps.

mycaule avatar Dec 28 '18 01:12 mycaule

@mycaule I am not using it anymore, but I will give it a try if I will use it in the future. Thanks for the feedback though.

iShiBin avatar Dec 29 '18 05:12 iShiBin

Folks, did you ever discover the root cause of this? I am suffering the same issue now in an SBT project

kalimist123 avatar Feb 04 '19 10:02 kalimist123

It may be a problem with the classpath of the JAR you are generating. Here is another fix inside Zeppelin.

%dep
z.load(“/path/to/your/app.jar”)

Also please make sure your EC2 machine has sufficient IAM credentials to fetch data on S3.

mycaule avatar Feb 04 '19 10:02 mycaule

Thanks for getting back to me. This is not a Zeppelin context, its a Scala SBT project

kalimist123 avatar Feb 04 '19 11:02 kalimist123

Make sur you use the latest version avec aws sdk and your code authenticating to their service is up to date too.

mycaule avatar Feb 04 '19 11:02 mycaule