Spark 3.3 support?
Just checking if the package supports Spark v3.3? I believe it does since there's no new Scala version to support since 2.13.
We have not tested it on Spark v3.3. It worked for v3.2. My guess is it will work, please give it a try.
I tried using spark-tfrecord_2.12-0.3.0.jar with Spark 3.3 (albeit a managed version on Azure) & the job fails with error
Caused by: java.lang.AbstractMethodError: org.apache.spark.sql.execution.datasources.OutputWriter.path()Ljava/lang/String; at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.$anonfun$write$2(FileFormatDataWriter.scala:177) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.$anonfun$write$2$adapted(FileFormatDataWriter.scala:177) at scala.collection.immutable.List.foreach(List.scala:431) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.write(FileFormatDataWriter.scala:177) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithMetrics(FileFormatDataWriter.scala:86) at org.apache.spark.sql.execution.datasources.FileFormatDataWriter.writeWithIterator(FileFormatDataWriter.scala:93) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:337) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:344)
It appears that https://github.com/linkedin/spark-tfrecord/blob/master/src/main/scala/com/linkedin/spark/datasources/tfrecord/TFRecordOutputWriter.scala doesn't implement "org.apache.spark.sql.execution.datasources.OutputWriter" path method. But this method was present even in Spark 3.2 https://github.com/apache/spark/blob/v3.2.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/OutputWriter.scala
My PySpark code is pretty rudimentary and writes a DF to temporary folder. df.write.format("tfrecord").option("recordType", "Example").save(str(time.time()))
@abhinavsarkari you may want to try 0.5.x for spark 3.3 as the README shows.
Version 0.1.x targets Spark 2.3 and Scala 2.11
Version 0.2.x targets Spark 2.4 and both Scala 2.11 and 2.12
Version 0.3.x targets Spark 3.0 and Scala 2.12
Version 0.4.x targets Spark 3.2 and Scala 2.12
Version 0.5.x targets Spark 3.2 and Scala 2.13
Version 0.6.x targets Spark 3.4 and both Scala 2.12 and 2.13
Somehow I must have missed this as Spark 3.3 is not explicitly called out. I didn't try 0.5.x as it targets scala 2.13 and the spark I am using is on scala 2.12. I tried spark-tfrecord_2.12-0.4.0.jar and this worked for me.