spark-redshift icon indicating copy to clipboard operation
spark-redshift copied to clipboard

com.databricks.spark.redshift.RedshiftFileFormat.supportDataType(Lorg/apache/spark/sql/types/DataType;Z)Z

Open ianarsenault opened this issue 7 years ago • 9 comments

I'm attempting to do a simple query from a redshift table and receiving the following error.

Exception in thread "main" java.lang.AbstractMethodError: com.databricks.spark.redshift.RedshiftFileFormat.supportDataType(Lorg/apache/spark/sql/types/DataType;Z)Z
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:48)
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:47)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:47)
	at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyReadSchema(DataSourceUtils.scala:39)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:400)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
	at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:168)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$10.apply(DataSourceStrategy.scala:293)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:326)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:325)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProjectRaw(DataSourceStrategy.scala:403)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.pruneFilterProject(DataSourceStrategy.scala:321)
	at org.apache.spark.sql.execution.datasources.DataSourceStrategy.apply(DataSourceStrategy.scala:289)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:63)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:78)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2$$anonfun$apply$2.apply(QueryPlanner.scala:75)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:157)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:157)
	at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1336)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:75)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$2.apply(QueryPlanner.scala:67)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:93)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:72)
	at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:68)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:77)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:77)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3360)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2545)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2759)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:255)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:292)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:746)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:705)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:714)
	at LogChecks$.main(LogChecks.scala:90)
	at LogChecks.main(LogChecks.scala)
18/12/12 10:14:24 INFO SparkContext: Invoking stop() from shutdown hook
18/12/12 10:14:24 INFO SparkUI: Stopped Spark web UI at http://10.75.8.183:4040
18/12/12 10:14:24 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/12/12 10:14:24 INFO MemoryStore: MemoryStore cleared
18/12/12 10:14:24 INFO BlockManager: BlockManager stopped
18/12/12 10:14:24 INFO BlockManagerMaster: BlockManagerMaster stopped
18/12/12 10:14:24 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/12/12 10:14:24 INFO SparkContext: Successfully stopped SparkContext
18/12/12 10:14:24 INFO ShutdownHookManager: Shutdown hook called
18/12/12 10:14:24 INFO ShutdownHookManager: Deleting directory /private/var/folders/01/333yn1tn7rb93sspgwp_z_p804ccgt/T/spark-7c91c661-11e5-4b22-8d9f-2afe851dcdd6

CheckRedshift.scala

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object CheckRedshift {
    def main(args: Array[String]): Unit = {

        val awsAccessKeyId="asdfasdf"
        val awsSecretAccessKey="asdlfjakfalksdf"
        val redshiftDBName = "dbName"
        val redshiftUserId = "*****"
        val redshiftPassword = "*****"
        val redshifturl = "blah-blah.us-west-1.redshift.amazonaws.com:5555"
        val jdbcURL = s"jdbc:redshift://$redshifturl/$redshiftDBName?user=$redshiftUserId&password=$redshiftPassword"
        println(jdbcURL)

        val sc = new SparkContext(new SparkConf().setAppName("SparkSQL").setMaster("local"))

        // Configure SparkContext to communicate with AWS
        val tempS3Dir = "s3a://mys3bucket/temp"
        sc.hadoopConfiguration.set("fs.s3a.access.key", awsAccessKeyId)
        sc.hadoopConfiguration.set("fs.s3a.secret.key", awsSecretAccessKey)

        // Create the SQL Context
        val sqlContext = SparkSession.builder.config(sc.getConf).getOrCreate()


        import sqlContext.implicits._

        val myQuery =
            """SELECT * FROM public.redshift_table
                        WHERE date_time >= '2018-01-01'
                        AND date_time < '2018-01-02'"""
        val df = sqlContext.read
          .format("com.databricks.spark.redshift")
          .option("url", jdbcURL)
          .option("tempdir", tempS3Dir)
          .option("forward_spark_s3_credentials", "true")
          .option("query", myQuery)
          .load()
        df.show()


    }
}

build.sbt

name := "log_check"

version := "0.1"

scalaVersion := "2.11.8"

// https://mvnrepository.com/artifact/org.apache.spark/spark-core
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.0"

// https://mvnrepository.com/artifact/org.apache.spark/spark-sql
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0"

libraryDependencies += "org.json4s" %% "json4s-native" % "3.2.9"

//https://github.com/databricks/spark-redshift
resolvers += "jitpack" at "https://jitpack.io"
libraryDependencies += "com.github.databricks" %% "spark-redshift" % "master-SNAPSHOT"

libraryDependencies += "org.apache.hadoop" % "hadoop-aws" % "2.6.0"

libraryDependencies += "com.amazonaws" % "aws-java-sdk" % "1.11.407"

libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.6.0"
libraryDependencies += "net.java.dev.jets3t" % "jets3t" % "0.9.4"

// https://mvnrepository.com/artifact/com.amazon/redshift-jdbc41
resolvers += "redshift" at "http://redshift-maven-repository.s3-website-us-east-1.amazonaws.com/release"
libraryDependencies += "com.amazon.redshift" % "redshift-jdbc4" % "1.2.10.1009"

// Temporary fix for: https://github.com/databricks/spark-redshift/issues/315
dependencyOverrides += "com.databricks" % "spark-avro_2.11" % "3.2.0"

ianarsenault avatar Dec 12 '18 15:12 ianarsenault

Same problem here :/

Bennyelg avatar Jan 27 '19 13:01 Bennyelg

+1

febinsathar avatar Feb 04 '19 00:02 febinsathar

They kindly stopped support the package because they provide it only the data bricks customers. poor but if we want it we need to maintain it by ourselves.

Bennyelg avatar Feb 05 '19 09:02 Bennyelg

This issue is related to Spark 2.4. A workaround is to use Spark 2.3.1. Unfortunately it seems nobody is working to add support for 2.4 see #418

It looks like this fork https://github.com/Yelp/spark-redshift added support to spark 2.4 but I didn't test it. To be checked

julienbachmann avatar Feb 12 '19 09:02 julienbachmann

I tried building the Yelp fork... still getting this error. Appreciate any thoughts on next steps to try and get this working.

andybrnr avatar Mar 07 '19 18:03 andybrnr

Same issue here

rsilvestre avatar Mar 13 '19 10:03 rsilvestre

The Yelp fork is not compatible with 2.4 (yet), we're working on it.. But it seems that udemy's fork might be: https://github.com/udemy/spark-redshift

lucagiovagnoli avatar Apr 18 '19 00:04 lucagiovagnoli

re-compile the lib with the latest spark version should resolve the problem. The latest one is having new abstract function call supportDataType in FileFormat.scala

vanhoale avatar May 16 '19 14:05 vanhoale

So, we've started a community edition of spark-redshift which works with spark2.4. Feel free to try it out! If you do, it'd be very helpful to receive your feedback. Any pull requests are very very welcome too.

https://github.com/spark-redshift-community/spark-redshift

pinging @andybrnr, @Bennyelg, @julienbachmann who were having issues with 2.4

lucagiovagnoli avatar Jul 02 '19 19:07 lucagiovagnoli