SparkNet
SparkNet copied to clipboard
error while running CifarApp
When I am running the CifarApp on SparkCluster, the following error comesup:
16/06/08 12:50:04 INFO DAGScheduler: ResultStage 14 (foreach at CifarApp.scala:105) failed in 0.040 s 16/06/08 12:50:04 INFO DAGScheduler: Job 8 failed: foreach at CifarApp.scala:105, took 0.049292 s Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 14.0 failed 1 times, most recent failure: Lost task 1.0 in stage 14.0 (TID 43, localhost): java.lang.ArrayIndexOutOfBoundsException
Looks like that is the line
workers.foreach(_ => workerStore.get[CaffeSolver]("solver").trainNet.setWeights(broadcastWeights.value))
It's possible that the lookup workerStore.get[CaffeSolver] is failing. So perhaps try just
workers.foreach(_ => workerStore.get[CaffeSolver]("solver"))
and see if that succeeds or fails.
If that is failing, it may be that some worker does not have a net on it. How many nodes are you using? And what are you passing into CifarApp for the number of workers?
@robertnishihara Did that but the following errors raised.
F1009 04:25:49.868021 8028 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000)
*** Check failure stack trace: ***
F1009 04:25:49.868021 8027 split_layer.cpp:21] Check failed: count_ == top[i]->count() (100 vs. 1000000) F1009 04:25:49.868021 8029 blob.cpp:21] Check failed: count_ == other.count() (1000000 vs. 100)
*** Check failure stack trace: ***
Aborted