CaffeOnSpark icon indicating copy to clipboard operation
CaffeOnSpark copied to clipboard

Num of Executors gets changed internally

Open abhaymise opened this issue 9 years ago • 4 comments

I put the bumber of executors and cluster size as per my cluster but when i run the training it gets changed and i get an error saying number of executors expected and available is different.

Is there any special rational for choosing the executors and cluster size.

As per the source code it should just read whatever configuration is passed by the command line.

I have a 4 node cluster ,each with 15 GB RAM,8 cores and 1 GPU.

abhaymise avatar Jul 27 '16 12:07 abhaymise

I am not aware of this in the source code, as you also observed. Which scheduler (resource manager) are you using? yarn?

junshi15 avatar Jul 27 '16 19:07 junshi15

This is happening in both cases :

with YARN as well as spark standalone manager.

Thanks and regards Abhay Kumar https://in.linkedin.com/in/abhay-kumar-99780458

On Thu, Jul 28, 2016 at 12:35 AM, Jun Shi [email protected] wrote:

I am not aware of this in the source code, as you also observed. Which scheduler (resource manager) are you using? yarn?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yahoo/CaffeOnSpark/issues/126#issuecomment-235687224, or mute the thread https://github.com/notifications/unsubscribe-auth/AIWia3zMgIkdAD8hoGRtYQUUVbzcBscgks5qZ6vkgaJpZM4JWI1p .

abhaymise avatar Jul 27 '16 19:07 abhaymise

clusterSize actually gets set in the source code. See my answer in https://github.com/yahoo/CaffeOnSpark/issues/125

junshi15 avatar Jul 27 '16 20:07 junshi15

I changed the settings as per your instructions :

Now i am providing only the no of cores and memory for each executor and let the scheduler decide no of executor.

As per my setting It decides no of executors as 8.

But if i run the training it again fails with nullponter exception.

My settings are as follows for :

spark-submit --master spark://$(hostname):7077
--files cifar10_quick_train_test_hdfs.prototxt,cifar10_quick_solver_hdfs.prototxt
--executor-memory 5G
--executor-cores 3
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf cifar10_quick_solver_hdfs.prototxt
-clusterSize 8
-devices ${DEVICES}
-connection ethernet
-model hdfs://master:9000/CIFAR/model/cifar10_train_df
-output hdfs://master:9000/CIFAR/feature/cifar10_train_df

The error log from one of the executors is :

INFO CoarseGrainedExecutorBackend: Got assigned task 25 16/07/29 14:10:38 INFO Executor: Running task 7.3 in stage 1.0 (TID 25) 16/07/29 14:10:38 ERROR Executor: Exception in task 7.3 in stage 1.0 (TID 25) java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply$mcVI$sp(CaffeOnSpark.scala:153) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149) at com.yahoo.ml.caffe.CaffeOnSpark$$anonfun$train$1.apply(CaffeOnSpark.scala:149) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.rdd.RDD$$anonfun$collect$1$$anonfun$12.apply(RDD.scala:927) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 16/07/29 14:10:39 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown 16/07/29 14:10:39 ERROR CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTE

Thanks and regards Abhay Kumar https://in.linkedin.com/in/abhay-kumar-99780458

On Thu, Jul 28, 2016 at 1:54 AM, Jun Shi [email protected] wrote:

clusterSize actually gets set in the source code. See my answer in #125 https://github.com/yahoo/CaffeOnSpark/issues/125

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/yahoo/CaffeOnSpark/issues/126#issuecomment-235709053, or mute the thread https://github.com/notifications/unsubscribe-auth/AIWia1nEqtmUPJSS7H31wnLZ25mZmGgvks5qZ75xgaJpZM4JWI1p .

abhaymise avatar Jul 29 '16 14:07 abhaymise