BigDL-2.x icon indicating copy to clipboard operation
BigDL-2.x copied to clipboard

How to solve py4j.protocol.Py4JJavaError

Open Cancerhzc opened this issue 3 years ago • 3 comments

I am doing a test by using the example given by the url https://github.com/intel-analytics/analytics-zoo/blob/master/pyzoo/zoo/examples/tensorflow/tfpark/keras/keras_dataset.py

because I am runing on yarn cluster, I change part of the source code

#from zoo import init_nncontext
from zoo import init_spark_on_yarn

sc = init_spark_on_yarn(
        hadoop_conf="/home/hzc/hadoop-3.2.2/etc/hadoop",
        conda_name="zoo", # The name of the created conda-env
        num_executors=3,
        executor_cores=3,
        executor_memory="5g",
        driver_memory="8g",
        driver_cores=3
        )

but when running after Cache thread models it appears an error

2021-03-24 15:37:49 ERROR TaskSetManager:70 - Task 1 in stage 5.0 failed 4 times; aborting job Traceback (most recent call last): File "test0323.py", line 96, in main(max_epoch) File "test0323.py", line 73, in main distributed=True) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/zoo/tfpark/model.py", line 143, in fit self._fit_distributed(x, epochs, **kwargs) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/zoo/tfpark/model.py", line 167, in _fit_distributed self.tf_optimizer.optimize(MaxEpoch(epochs)) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/zoo/tfpark/tf_optimizer.py", line 755, in optimize validation_method=self.tf_model.val_methods) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/zoo/pipeline/estimator/estimator.py", line 168, in train_minibatch validation_method) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/zoo/common/utils.py", line 133, in callZooFunc raise e File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/zoo/common/utils.py", line 127, in callZooFunc java_result = api(*args) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/py4j/java_gateway.py", line 1257, in call answer, self.gateway_client, self.target_id, self.name) File "/home/hzc/anaconda3/envs/zoo/lib/python3.6/site-packages/py4j/protocol.py", line 328, in get_return_value format(target_id, ".", name), value) py4j.protocol.Py4JJavaError: An error occurred while calling o69.estimatorTrainMiniBatch. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 5.0 failed 4 times, most recent failure: Lost task 1.3 in stage 5.0 (TID 33, slave5, executor 2): java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2797) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3272) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:932) at java.io.ObjectInputStream.(ObjectInputStream.java:394) at com.intel.analytics.zoo.tfpark.ModelInfoObjectInputStream.(TFModelBroadcast.scala:175) at com.intel.analytics.zoo.tfpark.ModelInfo.readObject(TFModelBroadcast.scala:163) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1185) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2294) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2185) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1665) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:501) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:459) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$8.apply(TorrentBroadcast.scala:308) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:309) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at com.intel.analytics.zoo.tfpark.TFModelBroadcast.value(TFModelBroadcast.scala:111) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11$$anonfun$12.apply(DistriOptimizer.scala:612) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11$$anonfun$12.apply(DistriOptimizer.scala:611) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11.apply(DistriOptimizer.scala:611) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11.apply(DistriOptimizer.scala:592) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1889) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1877) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1876) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1876) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:926) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:926) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2110) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2059) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2048) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:737) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2082) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2101) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126) at org.apache.spark.rdd.RDD.count(RDD.scala:1168) at com.intel.analytics.bigdl.optim.DistriOptimizer$.com$intel$analytics$bigdl$optim$DistriOptimizer$$initThreadModels(DistriOptimizer.scala:650) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at com.intel.analytics.zoo.pipeline.api.keras.layers.utils.KerasUtils$.invokeMethod(KerasUtils.scala:302) at com.intel.analytics.zoo.pipeline.api.keras.layers.utils.KerasUtils$.invokeMethodWithEv(KerasUtils.scala:329) at com.intel.analytics.zoo.pipeline.api.keras.models.InternalOptimizerUtil$.initThreadModels(Topology.scala:1043) at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1237) at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1472) at com.intel.analytics.zoo.pipeline.api.keras.models.InternalDistriOptimizer.train(Topology.scala:1145) at com.intel.analytics.zoo.pipeline.estimator.Estimator.train(Estimator.scala:190) at com.intel.analytics.zoo.pipeline.estimator.python.PythonEstimator.estimatorTrainMiniBatch(PythonEstimator.scala:117) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.EOFException at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2797) at java.io.ObjectInputStream$BlockDataInputStream.readShort(ObjectInputStream.java:3272) at java.io.ObjectInputStream.readStreamHeader(ObjectInputStream.java:932) at java.io.ObjectInputStream.(ObjectInputStream.java:394) at com.intel.analytics.zoo.tfpark.ModelInfoObjectInputStream.(TFModelBroadcast.scala:175) at com.intel.analytics.zoo.tfpark.ModelInfo.readObject(TFModelBroadcast.scala:163) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1185) at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2294) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2185) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1665) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:501) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:459) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$8.apply(TorrentBroadcast.scala:308) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:309) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$apply$2.apply(TorrentBroadcast.scala:235) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:211) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1326) at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:207) at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) at com.intel.analytics.zoo.tfpark.TFModelBroadcast.value(TFModelBroadcast.scala:111) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11$$anonfun$12.apply(DistriOptimizer.scala:612) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11$$anonfun$12.apply(DistriOptimizer.scala:611) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.Range.foreach(Range.scala:160) at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) at scala.collection.AbstractTraversable.map(Traversable.scala:104) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11.apply(DistriOptimizer.scala:611) at com.intel.analytics.bigdl.optim.DistriOptimizer$$anonfun$11.apply(DistriOptimizer.scala:592) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:337) at org.apache.spark.rdd.RDD$$anonfun$7.apply(RDD.scala:335) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1165) at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091) at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156) at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:882) at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:335) at org.apache.spark.rdd.RDD.iterator(RDD.scala:286) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more

Follows are my environment Java 1.8.0_271 Hadoop 3.2.2 Scala 2.12.13 Anaconda 3 (Python 3.6.13
Package Version


absl-py 0.12.0 analytics-zoo 0.9.0 astor 0.8.1 BigDL 0.12.1 cached-property 1.5.2 certifi 2020.12.5 conda-pack 0.3.1 gast 0.2.2 google-pasta 0.2.0 grpcio 1.36.1 h5py 3.1.0 importlib-metadata 3.7.3 Keras-Applications 1.0.8 Keras-Preprocessing 1.1.2 Markdown 3.3.4 numpy 1.19.5 opt-einsum 3.3.0 pip 21.0.1 protobuf 3.15.6 py4j 0.10.7 pyspark 2.4.3 setuptools 52.0.0.post20210125 six 1.15.0 tensorboard 1.15.0 tensorflow 1.15.0 tensorflow-estimator 1.15.1 termcolor 1.1.0 typing-extensions 3.7.4.3 Werkzeug 1.0.1 wheel 0.36.2 wrapt 1.12.1 zipp 3.4.1 )

Cancerhzc avatar Mar 24 '21 08:03 Cancerhzc

@jenniew could you help take a look at this? Seems to related to TFModelBroadcast.

yangw1234 avatar Mar 24 '21 11:03 yangw1234

@Cancerhzc This issue has been fixed recently. Please pip install latest nightly build wheels into your conda environment and try again. You can refer https://analytics-zoo.github.io/master/#PythonUserGuide/install/#install-the-latest-nightly-build-wheels-for-pip to download the nightly build wheel.

jenniew avatar Mar 25 '21 01:03 jenniew

@Cancerhzc The issue is fixed and pls test again, If any more questions, pls let us know , or we may close it soon.

helenlly avatar Nov 25 '21 06:11 helenlly