spark-redshift
spark-redshift copied to clipboard
Error while using spark-redshift jar
Hi,
Getting the below error while using the jar to integrate redshift with spark locally.
Exception in thread "main" java.lang.AbstractMethodError: com.databricks.spark.redshift.RedshiftFileFormat.prepareRead(Lorg/apache/spark/sql/SparkSession;Lscala/collection/immutable/Map;Lscala/collection/Seq;)Lscala/collection/immutable/Map;
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:330)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:160)
at com.databricks.spark.redshift.RedshiftRelation.buildScan(RedshiftRelation.scala:168)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:141)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$3.apply(DataSourceStrategy.scala:141)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:184)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$$anonfun$pruneFilterProject$1.apply(DataSourceStrategy.scala:183)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProjectRaw(DataSourceStrategy.scala:257)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.pruneFilterProject(DataSourceStrategy.scala:179)
at org.apache.spark.sql.execution.datasources.DataSourceStrategy$.apply(DataSourceStrategy.scala:137)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.planLater(QueryPlanner.scala:55)
at org.apache.spark.sql.execution.SparkStrategies$SpecialLimits$.apply(SparkStrategies.scala:54)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
at org.apache.spark.sql.catalyst.planning.QueryPlanner$$anonfun$1.apply(QueryPlanner.scala:59)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
at org.apache.spark.sql.catalyst.planning.QueryPlanner.plan(QueryPlanner.scala:60)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan$lzycompute(QueryExecution.scala:77)
at org.apache.spark.sql.execution.QueryExecution.sparkPlan(QueryExecution.scala:75)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:82)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:82)
at org.apache.spark.sql.Dataset.withTypedCallback(Dataset.scala:2462)
at org.apache.spark.sql.Dataset.head(Dataset.scala:1861)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2078)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:240)
at org.apache.spark.sql.Dataset.show(Dataset.scala:533)
at org.apache.spark.sql.Dataset.show(Dataset.scala:493)
at org.apache.spark.sql.Dataset.show(Dataset.scala:502)
at simpleSample.RedshiftToSpark$.main(RedshiftToSpark.scala:53)
at simpleSample.RedshiftToSpark.main(RedshiftToSpark.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
I find that prepareRead method is not in the RedshiftFileFormat.
Thanks & Regards, Ravi
Which version of Spark are you using? If you're using 2.1.x then I suspect that changes to internal APIs may have broke spark-redshift
, in which case we'll need to make a new release.
Actually, looking a little more closely since this problem relates to prepareRead
I don't think it's a 2.1.x issue since that method had been completely removed from Spark by that point (see https://github.com/apache/spark/pull/13698). According to https://issues.apache.org/jira/browse/SPARK-15983 that change went into 2.0.
Thus: are you using a newer version of spark-redshift
with Spark 1.x? You'll need to use a 1.x version of this library with Spark 1.x; newer versions won't work with Spark 1.x.
I'm getting the same exception with a different stack trace and only when I switch from spark 2.0.1 to spark 2.1.0/hadoop 2.7/mesos/spark-redshift_2.11-2.0.1.jar/RedshiftJDBC41-1.1.17.1017.jar
48f7-81e8-02403dbc2b57-S107): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
I'm getting this error as well, with spark 2.1.0, I've also tried using the 3.0.0-preview1 of this library, previously was using 2.0.0.
java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Edit: Here's a bit bigger stack trace that may help.
17/01/09 22:45:34 ERROR FileFormatWriter: Aborting job null.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 1.0 failed 1 times, most recent failure: Lost task 5.0 in stage 1.0 (TID 6, localhost, executor driver): java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1435)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1423)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1422)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1918)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1931)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:127)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:57)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:121)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:101)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:114)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:135)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:113)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:87)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:87)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:492)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:198)
at com.databricks.spark.redshift.RedshiftWriter.unloadData(RedshiftWriter.scala:295)
at com.databricks.spark.redshift.RedshiftWriter.saveToRedshift(RedshiftWriter.scala:392)
at com.databricks.spark.redshift.DefaultSource.createRelation(DefaultSource.scala:108)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at org.lucidhq.SFRedshiftETL.SFObject.redshiftLoad(SFObject.scala:115)
at org.lucidhq.SFRedshiftETL.SFObject.load(SFObject.scala:256)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$run$1.apply(main.scala:61)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$run$1.apply(main.scala:44)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL$.run(main.scala:44)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$main$1.apply(main.scala:83)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL$$anonfun$main$1.apply(main.scala:83)
at scala.Option.map(Option.scala:146)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL$.main(main.scala:83)
at org.lucidhq.SFRedshiftETL.SFRedshiftETL.main(main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.<init>(FileFormatWriter.scala:232)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:182)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:129)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1$$anonfun$3.apply(FileFormatWriter.scala:128)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
@JoshRosen Any plans to make a new release soon? Seems like it's needed to use this with 2.1.0.
@JoshRosen hit the same issue after upgrading from Spark 2.0.2 to Spark 2.1.0 our pipeline started throwing exceptions with the same cause
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/TaskAttemptContext;)Ljava/lang/String;
We are using spark-redsfhit 2.0.1 with https://s3.amazonaws.com/redshift-downloads/drivers/RedshiftJDBC41-1.1.17.1017.jar
@elyast hit the same issue using spark 2.1.0.
I make this question in Stackoverflow
Using the version 2.0.2 of Spark you have the same issue? I'm not able to make the spark-redshift work in 2.0.2, if possible a help will be useful.
found the root cause, spark 2.1 added new method to the interface:
org.apache.spark.sql.execution.datasources.OutputWriterFactory#def getFileExtension(context: TaskAttemptContext): String
which is not implemented in spark-avro, hence AbstractMethodError
Ran into the same issue with spark 2.1.0 , is there a work around (besides bumping the spark version down?).
@apurva-sharma you can build this patch: https://github.com/databricks/spark-avro/pull/206 and replace spark-avro dependency with this custom version, at least it worked for us
@elyast thanks for that, I can verify that monkey patching spark-avro as above worked for me with spark 2.1.0 It will be great if this is merged.
@apurva-sharma +1
looks like spark-avro was fixed. any updates here?
any updates when this issue will be fixed?
^ @JoshRosen
Atm this driver is completely unusable ...
Fixed mine by adding this line to sbt project build.sbt:
dependencyOverrides += spark_avro_320
where
val spark_avro_320: ModuleID = "com.databricks" % "spark-avro_2.11" % "3.2.0"
I am using spark-redshift 3 btw...
Hopefully this library can be actively supported in the long run, it looks like it has not been updated for several months....
I've tried what @hnfmr suggests, but I am still running into this issue.
@mrdmnd To be specific, I am using the Spark-Redshift v3.0.0-preview1 and my build.sbt
looks like:
lazy val app = (project in file("app")).
.settings(commonSettings: _*)
.settings(
libraryDependencies += "com.databricks" % "spark-redshift_2.11" % "3.0.0-preview1",
dependencyOverrides += "com.databricks" % "spark-avro_2.11" % "3.2.0"
)
)
BTW, I am using Spark 2.1.0... hope this helps
@elyast Can you please describe what you did? My guess:
- Clone the
spark-avro
repo and checkout the commit of that PR (post-merge). - Build the jar.
- Use SBT to use this jar. (Do you know how to do this offhand?)
Thank you!
Also seeing this issue here. @hnfmr's fix is working for me now, but it would be nice to have this properly fixed. Spark is a popular tool and Redshift usage is only going to grow.
Exact workaround was to add the following to my build.sbt file:
// Temporary fix for: https://github.com/databricks/spark-redshift/issues/315
dependencyOverrides += "com.databricks" % "spark-avro_2.11" % "3.2.0"
Yeah, I had a minor typo. Can confirm that this works.
I use Zeppelin to do ETL to redshift and encountered the same AbstractMethodError.
By configuring the spark interpreter to exclude com.databricks:spark-avro_2.11:3.0.0
while depending on com.databricks:spark-redshift_2.11:2.0.1
, and then to specify another dependency on com.databricks:spark-avro_2.11:3.2.0
works for me
Thanks a lot!
Yes! Just update or replace spark-avro_2.11-3.1.0.jar with spark-avro_2.11-3.2.0.jar and this problem should be solved now.
https://mvnrepository.com/artifact/com.databricks/spark-avro_2.11/3.2.0
HI, I have got the same problem. I am using spark 2.1.0 and tried using spark-redshift 3.0.0-preview1 and 2.0.1, 2.0.0. All of them gives the same error.
java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriterFactory.getFileExtension(Lorg/apache/hadoop/mapreduce/T$
skAttemptContext;)Ljava/lang/String;
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.
I have the same problem and I am using code in spark branch 2.2. Spark avro was spark-avro_2.11-3.2.0.jar already.
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriter.write(Lorg/apache/spark/sql/catalyst/InternalRow;)V
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:318)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:249)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:247)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1375)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:252)
Any updates on this one? It seems that the underlying dependency (spark-avro_2.11-3.2.0) has resolved this issue. Instead of having everyone depend on the workaround, could the owner release a version that depends on the 3.2.0 of spark-avro?
It seems this issue and repo are getting stale, would love to have this updated. @JoshRosen would it be possible to open this up to new contributors?
Any updates on this? I'm using this through pyspark and am unable to try the work arounds suggested.
Looks like this issue is going to be fixed in next version of spark-avro lib - https://github.com/databricks/spark-avro/pull/242. It's merged to master 8 days ago