ecosystem
ecosystem copied to clipboard
Failed to find data source: tensorflow
import com.tencent.mmsearch_x.SparkTool._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._
val path = "./output/test-output.tfrecord"
val testRows: Array[Row] = Array(
new GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), "r1")),
new GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), "r2")))
val schema = StructType(List(StructField("id", IntegerType),
StructField("IntegerTypeLabel", IntegerType),
StructField("LongTypeLabel", LongType),
StructField("FloatTypeLabel", FloatType),
StructField("DoubleTypeLabel", DoubleType),
StructField("VectorLabel", ArrayType(DoubleType, true)),
StructField("name", StringType)))
val rdd = spark.sparkContext.parallelize(testRows)
//Save DataFrame as TFRecords
val df: DataFrame = spark.createDataFrame(rdd, schema)
df.show()
df.printSchema()
//.option("recordType", "Example")
df.write.format("tensorflow").save(path) //or df.write.format("tfrecords").save(path)
both format "tensorflow" and "tfrecords" result to { Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tensorflow }
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tensorflow. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:549) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) at tensorflow.maketfrecord.tfrecord_demo$.main(tfrecord_demo.scala:57) at tensorflow.maketfrecord.tfrecord_demo.main(tfrecord_demo.scala)
finally found that it need to depend on org.tensorflow.spark-tensorflow-connector, but org.tensorflow.spark-tensorflow-connector was not public to maven yet. Do you have plan to pubilc org.tensorflow.spark-tensorflow-connector to maven ?
I add dependency for spark-tensorflow-connector, but still have this problem,did you resolve it ?
You need to build the jars manually using mvn install. We are working on the jars available on the public maven repo https://github.com/tensorflow/tensorflow/pull/19188
Can you elaborate a bit on why do we need to (and how to) build the jar manually, if we already get the dependency from maven (see below snippet)?
libraryDependencies
++= Seq("org.tensorflow" %% "spark-tensorflow-connector" % "1.6.0"),
resolvers += "Kompics Releases" at "http://kompics.sics.se/maven/repository/"
I am still seeing errors Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tfrecords.
@LeiG If you already have the jar, you don't need to recompile it. At what point do you get this error? At compile time or runtime? If it occurs at runtime, you can either need to include the --jars
argument when running spark-submit
or spark-shell
as described in the README. The other option is to build an assembly jar with sbt for your application that includes your dependencies.
I'm facing the same issue. I built the jar following the instruction, and when I use the jar with pyspark
or spark-shell
, it works on the demo code, but when I build the maven assembly jar, and run the jar with java -cp the_target.jar my.target.class
, it gives me the error:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tfrecords. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:549)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
at transformation.TFRecordMaker.main(TFRecordMaker.java:28)
Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:533)
... 19 more
the problem is resolved. passing connector.jar to pyspark
and spark-shell
works makes me think why a fat jar with dependencies does not work. It seems when you run the uber jar, spark does not load the connector even if it is there.
so finally, I find a way to work around, passing it pargramly like below:
SparkSession ss = SparkSession
.builder()
.master(masterUrl)
.appName(appName)
.config("hive.metastore.uris", metaUri)
.config("spark.jars", "location/to/spark-tensorflow-connector_2.11-1.12.0.jar")
.enableHiveSupport()
.getOrCreate();
works.