spark-tfrecord icon indicating copy to clipboard operation
spark-tfrecord copied to clipboard

Spark 3.3 support?

Open InvisibleMan1306 opened this issue 3 years ago • 4 comments

Just checking if the package supports Spark v3.3? I believe it does since there's no new Scala version to support since 2.13.

InvisibleMan1306 avatar Dec 25 '22 20:12 InvisibleMan1306

We have not tested it on Spark v3.3. It worked for v3.2. My guess is it will work, please give it a try.

junshi15 avatar Dec 27 '22 14:12 junshi15

I tried using spark-tfrecord_2.12-0.3.0.jar with Spark 3.3 (albeit a managed version on Azure) & the job fails with error

Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriter.path()Ljava/lang/String; at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.$anonfun$write$2(FileFormatDataWriter.scala:177) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.$anonfun$write$2$adapted(FileFormatDataWriter.scala:177) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:177) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:86) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:93) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:337) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:344)

It appears that https://github.com/linkedin/spark-tfrecord/blob/master/src/main/scala/com/linkedin/spark/datasources/tfrecord/TFRecordOutputWriter.scala doesn't implement "org.apache.spark.sql.execution.datasources.OutputWriter" path method. But this method was present even in Spark 3.2 https://github.com/apache/spark/blob/v3.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala

My PySpark code is pretty rudimentary and writes a DF to temporary folder. df.write.format("tfrecord").option("recordType", "Example").save(str(time.time()))

abhinavsarkari avatar Oct 11 '23 17:10 abhinavsarkari

@abhinavsarkari you may want to try 0.5.x for spark 3.3 as the README shows.

Version 0.1.x targets Spark 2.3 and Scala 2.11
Version 0.2.x targets Spark 2.4 and both Scala 2.11 and 2.12
Version 0.3.x targets Spark 3.0 and Scala 2.12
Version 0.4.x targets Spark 3.2 and Scala 2.12
Version 0.5.x targets Spark 3.2 and Scala 2.13
Version 0.6.x targets Spark 3.4 and both Scala 2.12 and 2.13

mizhou-in avatar Oct 11 '23 18:10 mizhou-in

Somehow I must have missed this as Spark 3.3 is not explicitly called out. I didn't try 0.5.x as it targets scala 2.13 and the spark I am using is on scala 2.12. I tried spark-tfrecord_2.12-0.4.0.jar and this worked for me.

abhinavsarkari avatar Oct 11 '23 20:10 abhinavsarkari