SparkNet error while running CifarApp

error while running CifarApp

Open prakhar21 opened this issue 9 years ago • 2 comments

When I am running the CifarApp on SparkCluster, the following error comesup:

16/06/08 12:50:04 INFO DAGScheduler: ResultStage 14 (foreach at CifarApp.scala:105) failed in 0.040 s 16/06/08 12:50:04 INFO DAGScheduler: Job 8 failed: foreach at CifarApp.scala:105, took 0.049292 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 43, localhost): java.lang.ArrayIndexOutOfBoundsException

Jun 08 '16 07:06 prakhar21

Looks like that is the line

 workers.foreach(_ => workerStore.get[CaffeSolver]("solver").trainNet.setWeights(broadcastWeights.value))

It's possible that the lookup workerStore.get[CaffeSolver] is failing. So perhaps try just

 workers.foreach(_ => workerStore.get[CaffeSolver]("solver"))

and see if that succeeds or fails.

If that is failing, it may be that some worker does not have a net on it. How many nodes are you using? And what are you passing into CifarApp for the number of workers?

Jun 11 '16 05:06 robertnishihara

@robertnishihara Did that but the following errors raised.

F1009 04:25:49.868021  8028 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000)
*** Check failure stack trace: ***
F1009 04:25:49.868021  8027 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000) F1009 04:25:49.868021  8029 blob.cpp:21] Check failed: count_ == other.count() (1000000 vs. 100)
*** Check failure stack trace: ***
Aborted

Oct 09 '16 04:10 hckuo2

SparkNet SparkNet copied to clipboard

error while running CifarApp

SparkNet
SparkNet copied to clipboard