ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

Change spark-tensorflow-connector dependency to be spark 3.0.0 preview

Open WeichenXu123 opened this issue 5 years ago • 17 comments

Change spark-tensorflow-connector to be spark-3.0.0-preview2 Test:

cd $PROJ_HOME/hadoop
mvn clean install  # build tensorflow-hadoop:1.10.0 and install into local repo

cd $PROJ_HOME/spark/spark-tensorflow-connector
mvn clean install

WeichenXu123 avatar Oct 09 '19 03:10 WeichenXu123

Thanks for your pull request. It looks like this may be your first contribution to a Google open source project (if not, look below for help). Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

:memo: Please visit https://cla.developers.google.com/ to sign.

Once you've signed (or fixed any issues), please reply here with @googlebot I signed it! and we'll verify it.


What to do if you already signed the CLA

Individual signers
Corporate signers

ℹ️ Googlers: Go here for more info.

googlebot avatar Oct 09 '19 03:10 googlebot

@googlebot I signed it!

WeichenXu123 avatar Oct 09 '19 04:10 WeichenXu123

CLAs look good, thanks!

ℹ️ Googlers: Go here for more info.

googlebot avatar Oct 09 '19 04:10 googlebot

@jhseu Could you help review it ? Thanks! Just run mvn clean install under directory spark/spark-tensorflow-connector to verify PR correctness. Btw why there's no jenkins test ?

WeichenXu123 avatar Oct 09 '19 04:10 WeichenXu123

When I try to build this, I'm hitting:

[ERROR] Failed to execute goal on project spark-tensorflow-connector_2.11: Could not resolve dependencies for project org.tensorflow:spark-tensorflow-connector_2.11:jar:1.10.0: Could not find artifact org.tensorflow:tensorflow-hadoop:jar:1.10.0 in central (https://repo.maven.apache.org/maven2) -> [Help 1]

It looks like this tries to get a tensorflow-hadoop version which matches the spark-tensorflow-connector version. Is that intentional (given that tensorflow-hadoop is on version 1.14.0, whereas spark-tensorflow-connector is on version 1.10.0)?

jkbradley avatar Oct 10 '19 18:10 jkbradley

@jkbradley Yes, the project version is 1.10, so it will depend on tensorflow-hadoop:1.10.0 version.

The default maven repo only include tensorflow-hadoop version >= 1.11, so we should enter hadoop directory to build it first, command is:

cd $PROJ_HOME/hadoop
mvn clean install  # build tensorflow-hadoop:1.10.0 and install into local repo

cd $PROJ_HOME/spark/spark-tensorflow-connector
mvn clean install

WeichenXu123 avatar Oct 11 '19 14:10 WeichenXu123

Whoops, my bad, did not realize it's in the same project & is a manually handled dependency. Thanks!

jkbradley avatar Oct 11 '19 22:10 jkbradley

Since this project's CI isn't running, I tested this PR locally. It may have some flakiness in the impl or tests right now. I ran the tests once (mvn clean install) and hit the following failure. But then I ran them again (mvn test) & they passed. I ran a 3rd time (mvn clean install) and they passed.

Failure in LocalWriteSuite:

- should write data locally *** FAILED ***
  org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0 (TID 1, c02w81rbhtd5.attlocal.net, executor driver): java.lang.IllegalStateException: LocalPath /var/folders/y_/_46df7ns1cn8dj_6hrs2fdxm0000gp/T/spark-connector-propagate2230735357410018221 already exists. SaveMode: ErrorIfExists.
	at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.writePartitionLocal(DefaultSource.scala:182)
	at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.mapFun$1(DefaultSource.scala:212)
	at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.$anonfun$writePartitionLocalFun$1(DefaultSource.scala:214)
	at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.$anonfun$writePartitionLocalFun$1$adapted(DefaultSource.scala:214)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:455)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
  at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:1979)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:1967)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:1966)
  at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
  at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
  at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1966)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:946)
  at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:946)
  at scala.Option.foreach(Option.scala:407)
  ...
  Cause: java.lang.IllegalStateException: LocalPath /var/folders/y_/_46df7ns1cn8dj_6hrs2fdxm0000gp/T/spark-connector-propagate2230735357410018221 already exists. SaveMode: ErrorIfExists.
  at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.writePartitionLocal(DefaultSource.scala:182)
  at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.mapFun$1(DefaultSource.scala:212)
  at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.$anonfun$writePartitionLocalFun$1(DefaultSource.scala:214)
  at org.tensorflow.spark.datasources.tfrecords.DefaultSource$.$anonfun$writePartitionLocalFun$1$adapted(DefaultSource.scala:214)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2(RDD.scala:889)
  at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsWithIndex$2$adapted(RDD.scala:889)
  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:349)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:313)
  at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
  ...

Also, one nit: the name of the artifact in the pom should be updated to 2.12: spark-tensorflow-connector_2.11

jkbradley avatar Oct 15 '19 18:10 jkbradley

I'm not opposed to this, but wouldn't it be better to wait until Spark 3.0.0 is released?

jhseu avatar Oct 15 '19 22:10 jhseu

@jhseu After we verify correctness, we can keep this PR open so less work for users who want to try out Spark 3.0 preview with spark-tensorflow-connector.

mengxr avatar Oct 15 '19 22:10 mengxr

Yeah, I don't mind keeping this open.

jhseu avatar Oct 15 '19 23:10 jhseu

@jkbradley Flaky test fixed. You could retest it. And pom artifact is updated to 2_12.

WeichenXu123 avatar Oct 16 '19 09:10 WeichenXu123

@WeichenXu123 Could you explain the test flakiness? Is it relevant to Spark 3.0 upgrade? If not, let's submit another PR so the fix can go in.

mengxr avatar Oct 16 '19 16:10 mengxr

@mengxr Not relevant to spark 3.0. Create new PR here with some explanation https://github.com/tensorflow/ecosystem/pull/144

WeichenXu123 avatar Oct 17 '19 02:10 WeichenXu123

@jhseu If we do not plan to make a new release that is 2.4 compatible, shall we review and merge this PR?

mengxr avatar Mar 31 '20 14:03 mengxr

Hi, we would like to use this library with spark 2.4 and scala 2.12.10. Would it be possible to support multiple versions with multiple profiles? I should probably create an issue but just wanted to ask here as well.

vikatskhay avatar Apr 24 '20 07:04 vikatskhay

Now Spark 3.0.0 is released. And we need https://mvnrepository.com/artifact/org.tensorflow/spark-tensorflow-connector_2.12 to be released, I think.

kangnak avatar Jun 22 '20 09:06 kangnak