ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

Failed to find data source: tensorflow

Open rolanyan opened this issue 6 years ago • 8 comments

import com.tencent.mmsearch_x.SparkTool._
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.commons.io.FileUtils
import org.apache.spark.sql.{ DataFrame, Row }
import org.apache.spark.sql.catalyst.expressions.GenericRow
import org.apache.spark.sql.types._
val path = "./output/test-output.tfrecord"
val testRows: Array[Row] = Array(
  new GenericRow(Array[Any](11, 1, 23L, 10.0F, 14.0, List(1.0, 2.0), "r1")),
  new GenericRow(Array[Any](21, 2, 24L, 12.0F, 15.0, List(2.0, 2.0), "r2")))
val schema = StructType(List(StructField("id", IntegerType),
  StructField("IntegerTypeLabel", IntegerType),
  StructField("LongTypeLabel", LongType),
  StructField("FloatTypeLabel", FloatType),
  StructField("DoubleTypeLabel", DoubleType),
  StructField("VectorLabel", ArrayType(DoubleType, true)),
  StructField("name", StringType)))

val rdd = spark.sparkContext.parallelize(testRows)

//Save DataFrame as TFRecords
val df: DataFrame = spark.createDataFrame(rdd, schema)
df.show()
df.printSchema()

//.option("recordType", "Example")
df.write.format("tensorflow").save(path)  //or df.write.format("tfrecords").save(path)

both format "tensorflow" and "tfrecords" result to { Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tensorflow }

rolanyan avatar Oct 28 '17 14:10 rolanyan

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tensorflow. Please find packages at http://spark.apache.org/third-party-projects.html at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:549) at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86) at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56) at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217) at tensorflow.maketfrecord.tfrecord_demo$.main(tfrecord_demo.scala:57) at tensorflow.maketfrecord.tfrecord_demo.main(tfrecord_demo.scala)

rolanyan avatar Oct 28 '17 14:10 rolanyan

finally found that it need to depend on org.tensorflow.spark-tensorflow-connector, but org.tensorflow.spark-tensorflow-connector was not public to maven yet. Do you have plan to pubilc org.tensorflow.spark-tensorflow-connector to maven ?

rolanyan avatar Oct 28 '17 15:10 rolanyan

I add dependency for spark-tensorflow-connector, but still have this problem,did you resolve it ?

seabiscuit08 avatar Apr 17 '18 13:04 seabiscuit08

You need to build the jars manually using mvn install. We are working on the jars available on the public maven repo https://github.com/tensorflow/tensorflow/pull/19188

skavulya avatar May 10 '18 00:05 skavulya

Can you elaborate a bit on why do we need to (and how to) build the jar manually, if we already get the dependency from maven (see below snippet)?

libraryDependencies
      ++= Seq("org.tensorflow" %% "spark-tensorflow-connector" % "1.6.0"),
    resolvers += "Kompics Releases" at "http://kompics.sics.se/maven/repository/"

I am still seeing errors Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tfrecords.

LeiG avatar Jun 05 '18 20:06 LeiG

@LeiG If you already have the jar, you don't need to recompile it. At what point do you get this error? At compile time or runtime? If it occurs at runtime, you can either need to include the --jars argument when running spark-submit or spark-shell as described in the README. The other option is to build an assembly jar with sbt for your application that includes your dependencies.

skavulya avatar Jun 06 '18 12:06 skavulya

I'm facing the same issue. I built the jar following the instruction, and when I use the jar with pyspark or spark-shell, it works on the demo code, but when I build the maven assembly jar, and run the jar with java -cp the_target.jar my.target.class, it gives me the error:

Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: tfrecords. Please find packages at http://spark.apache.org/third-party-projects.html
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:549)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:86)
        at org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:470)
        at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:58)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:56)
        at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:74)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:117)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)
        at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
        at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:135)
        at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:116)
        at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:92)
        at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:92)
        at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:610)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:233)
        at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:217)
        at transformation.TFRecordMaker.main(TFRecordMaker.java:28)
Caused by: java.lang.ClassNotFoundException: tfrecords.DefaultSource
        at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21$$anonfun$apply$12.apply(DataSource.scala:533)
        at scala.util.Try$.apply(Try.scala:192)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$21.apply(DataSource.scala:533)
        at scala.util.Try.orElse(Try.scala:84)
        at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:533)
        ... 19 more

legatoo avatar Dec 18 '18 13:12 legatoo

the problem is resolved. passing connector.jar to pyspark and spark-shell works makes me think why a fat jar with dependencies does not work. It seems when you run the uber jar, spark does not load the connector even if it is there.

image

so finally, I find a way to work around, passing it pargramly like below:

SparkSession ss = SparkSession
                    .builder()
                    .master(masterUrl)
                    .appName(appName)
                    .config("hive.metastore.uris", metaUri)
                    .config("spark.jars", "location/to/spark-tensorflow-connector_2.11-1.12.0.jar")
                    .enableHiveSupport()
                    .getOrCreate();

works.

legatoo avatar Dec 19 '18 02:12 legatoo