CaffeOnSpark icon indicating copy to clipboard operation
CaffeOnSpark copied to clipboard

How to use CaffeOnSpark

Open githubier opened this issue 8 years ago • 8 comments

I want to use the caffenet to train my data, and I use it to train my data on Caffe before, but I don't know how to use this model in CaffeOnSpark. The wiki just show me how to train a DNN, so I want to know how I modify the spark submit order to use the caffenet model, or some other way to use it.

githubier avatar Jan 10 '17 08:01 githubier

I modify the example of spark submit in the wiki like the code below to use caffenet for training, spark-submit --master ${MASTER_URL}
--files ${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet/solver.prototxt,${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices 1
-connection ethernet
-model file:${CAFFE_ON_SPARK}/data/myself/myself_caffenet.model
-output file:${CAFFE_ON_SPARK}/data/myself/myself_result

uh, the solver,prototxt & the train_val.prototxt in the ${CAFFE_ON_SPARK}/caffe-public/models/bvlc_reference_caffenet, and my data in the ${CAFFE_ON_SPARK}/data/myself, I wonder know whether my change is right, and when I use input the above code, it show me the error below: 17/01/11 14:27:19 INFO spark.SparkContext: Running Spark version 1.5.1 17/01/11 14:27:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 17/01/11 14:27:19 WARN spark.SparkConf: SPARK_WORKER_INSTANCES was detected (set to '1'). This is deprecated in Spark 1.0+.

Please instead use:

  • ./spark-submit with --num-executors to specify the number of executors
  • Or set SPARK_EXECUTOR_INSTANCES
  • spark.executor.instances to configure the number of instances in the spark config.

17/01/11 14:27:19 INFO spark.SecurityManager: Changing view acls to: master 17/01/11 14:27:19 INFO spark.SecurityManager: Changing modify acls to: master 17/01/11 14:27:19 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(master); users with modify permissions: Set(master) 17/01/11 14:27:20 INFO slf4j.Slf4jLogger: Slf4jLogger started 17/01/11 14:27:20 INFO Remoting: Starting remoting 17/01/11 14:27:20 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://[email protected]:46792] 17/01/11 14:27:20 INFO util.Utils: Successfully started service 'sparkDriver' on port 46792. 17/01/11 14:27:20 INFO spark.SparkEnv: Registering MapOutputTracker 17/01/11 14:27:20 INFO spark.SparkEnv: Registering BlockManagerMaster 17/01/11 14:27:20 INFO storage.DiskBlockManager: Created local directory at /home/master/Downloads/spark_sdk/spark-1.5.1/blockmgr-0a1f42d3-9d6e-441d-b223-0f7b60df7607 17/01/11 14:27:20 INFO storage.MemoryStore: MemoryStore started with capacity 530.3 MB 17/01/11 14:27:20 INFO spark.HttpFileServer: HTTP File server directory is /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/httpd-1aed9920-fae1-46b8-a4d1-66ffaba49548 17/01/11 14:27:20 INFO spark.HttpServer: Starting HTTP Server 17/01/11 14:27:20 INFO server.Server: jetty-8.y.z-SNAPSHOT 17/01/11 14:27:20 INFO server.AbstractConnector: Started [email protected]:36673 17/01/11 14:27:20 INFO util.Utils: Successfully started service 'HTTP file server' on port 36673. 17/01/11 14:27:20 INFO spark.SparkEnv: Registering OutputCommitCoordinator 17/01/11 14:27:20 INFO server.Server: jetty-8.y.z-SNAPSHOT 17/01/11 14:27:20 INFO server.AbstractConnector: Started [email protected]:4040 17/01/11 14:27:20 INFO util.Utils: Successfully started service 'SparkUI' on port 4040. 17/01/11 14:27:20 INFO ui.SparkUI: Started SparkUI at http://192.168.1.102:4040 17/01/11 14:27:20 INFO spark.SparkContext: Added JAR file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar at http://192.168.1.102:36673/jars/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar with timestamp 1484116040832 17/01/11 14:27:20 INFO util.Utils: Copying /home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/solver.prototxt to /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/userFiles-c70014b3-28ad-4b30-919f-35815b830b2f/solver.prototxt 17/01/11 14:27:20 INFO spark.SparkContext: Added file file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/solver.prototxt at http://192.168.1.102:36673/files/solver.prototxt with timestamp 1484116040918 17/01/11 14:27:20 INFO util.Utils: Copying /home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt to /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e/userFiles-c70014b3-28ad-4b30-919f-35815b830b2f/train_val.prototxt 17/01/11 14:27:20 INFO spark.SparkContext: Added file file:/home/master/Downloads/spark_sdk/CaffeOnSpark/caffe-public/models/bvlc_reference_caffenet/train_val.prototxt at http://192.168.1.102:36673/files/train_val.prototxt with timestamp 1484116040924 17/01/11 14:27:20 WARN metrics.MetricsSystem: Using default name DAGScheduler for source because spark.app.id is not set. 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Connecting to master spark://master:7077... 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20170111142721-0004 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/0 on worker-20170111005531-192.168.1.104-50421 (192.168.1.104:50421) with 2 cores 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/0 on hostPort 192.168.1.104:50421 with 2 cores, 1024.0 MB RAM 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/1 on worker-20170111005518-192.168.1.103-41380 (192.168.1.103:41380) with 1 cores 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/1 on hostPort 192.168.1.103:41380 with 1 cores, 1024.0 MB RAM 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor added: app-20170111142721-0004/2 on worker-20170111135519-192.168.1.102-38652 (192.168.1.102:38652) with 1 cores 17/01/11 14:27:21 INFO cluster.SparkDeploySchedulerBackend: Granted executor ID app-20170111142721-0004/2 on hostPort 192.168.1.102:38652 with 1 cores, 1024.0 MB RAM 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/0 is now LOADING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/2 is now LOADING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/1 is now LOADING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/0 is now RUNNING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/1 is now RUNNING 17/01/11 14:27:21 INFO client.AppClient$ClientEndpoint: Executor updated: app-20170111142721-0004/2 is now RUNNING 17/01/11 14:27:21 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 44782. 17/01/11 14:27:21 INFO netty.NettyBlockTransferService: Server created on 44782 17/01/11 14:27:21 INFO storage.BlockManagerMaster: Trying to register BlockManager 17/01/11 14:27:21 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:44782 with 530.3 MB RAM, BlockManagerId(driver, 192.168.1.102, 44782) 17/01/11 14:27:21 INFO storage.BlockManagerMaster: Registered BlockManager 17/01/11 14:27:22 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:38869/user/Executor#-249835689]) with ID 1 17/01/11 14:27:22 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.103:37499 with 530.3 MB RAM, BlockManagerId(1, 192.168.1.103, 37499) 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:38675/user/Executor#1450319069]) with ID 2 17/01/11 14:27:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.102:38538 with 530.3 MB RAM, BlockManagerId(2, 192.168.1.102, 38538) 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Registered executor: AkkaRpcEndpointRef(Actor[akka.tcp://[email protected]:39565/user/Executor#-1870584609]) with ID 0 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 1.0 17/01/11 14:27:23 INFO storage.BlockManagerMasterEndpoint: Registering block manager 192.168.1.104:59533 with 530.3 MB RAM, BlockManagerId(0, 192.168.1.104, 59533) Exception in thread "main" java.io.FileNotFoundException: solver.prototxt (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at java.io.FileInputStream.(FileInputStream.java:101) at java.io.FileReader.(FileReader.java:58) at com.yahoo.ml.jcaffe.Utils.GetSolverParam(Utils.java:14) at com.yahoo.ml.caffe.Config.protoFile_$eq(Config.scala:64) at com.yahoo.ml.caffe.Config.(Config.scala:366) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:34) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/11 14:27:23 INFO spark.SparkContext: Invoking stop() from shutdown hook 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null} 17/01/11 14:27:23 INFO handler.ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null} 17/01/11 14:27:23 INFO ui.SparkUI: Stopped Spark web UI at http://192.168.1.102:4040 17/01/11 14:27:23 INFO scheduler.DAGScheduler: Stopping DAGScheduler 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Shutting down all executors 17/01/11 14:27:23 INFO cluster.SparkDeploySchedulerBackend: Asking each executor to shut down 17/01/11 14:27:23 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped! 17/01/11 14:27:23 INFO storage.MemoryStore: MemoryStore cleared 17/01/11 14:27:23 INFO storage.BlockManager: BlockManager stopped 17/01/11 14:27:23 INFO storage.BlockManagerMaster: BlockManagerMaster stopped 17/01/11 14:27:23 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped! 17/01/11 14:27:23 INFO spark.SparkContext: Successfully stopped SparkContext 17/01/11 14:27:23 INFO util.ShutdownHookManager: Shutdown hook called 17/01/11 14:27:23 INFO util.ShutdownHookManager: Deleting directory /home/master/Downloads/spark_sdk/spark-1.5.1/spark-df60b6b8-cbc9-45f4-a001-004923b2196e

Can someone help me?

githubier avatar Jan 11 '17 06:01 githubier

Were you able to run the examples in Wiki? You command appeared to be correct, but Spark was complaining about not able to find solver.prototxt.

junshi15 avatar Jan 11 '17 23:01 junshi15

Sure, I have run the example successfully. But I don't know why this command is error, and I have solver.prototxt in the path in my command.

githubier avatar Jan 12 '17 02:01 githubier

Glad to know the examples worked for you. I don't know why your command failed.

junshi15 avatar Jan 12 '17 05:01 junshi15

Thank you for your help. I change my solver.prototxt and train_val.prototxt to the ${CAFFE_ON_SPARK}/data/, so the spark submit is: spark-submit --master ${MASTER_URL}
--files ${CAFFE_ON_SPARK}/data/solver.prototxt,${CAFFE_ON_SPARK}/data/train_val.prototxt
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-features accuracy,loss -label label
-conf solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices 1
-connection ethernet
-model file:${CAFFE_ON_SPARK}/myself_caffenet.model
-output file:${CAFFE_ON_SPARK}/myself_result

However, the there is also a error: 17/01/11 20:49:34 ERROR caffe.DataSource$: source_class must be defined for input data layer:Data Exception in thread "main" java.lang.NullPointerException at com.yahoo.ml.caffe.CaffeOnSpark.train(CaffeOnSpark.scala:103) at com.yahoo.ml.caffe.CaffeOnSpark$.main(CaffeOnSpark.scala:40) at com.yahoo.ml.caffe.CaffeOnSpark.main(CaffeOnSpark.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:672) at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 17/01/11 20:49:34 INFO spark.SparkContext: Invoking stop() from shutdown hook

It make me sad, I don't know why it throw the NullPointerException.....

githubier avatar Jan 12 '17 05:01 githubier

Should I use the caffenet_train_net.prototxt in the ${CAFFE_ON_SPARK}/data, instead of the train_val.prototxt in the ${CAFFE_ON_SPARK}/caffe_public/models/bvlc_reference_caffenet ? And whether should I change the mean in the caffenet_train_net.prototxt?

githubier avatar Jan 12 '17 07:01 githubier

You did not define source_class? Depending on your source data format, you need to tell CaffeOnSpark about it, e.g. https://github.com/yahoo/CaffeOnSpark/blob/master/data/lenet_cos_train_test.prototxt#L10-L12

junshi15 avatar Jan 13 '17 15:01 junshi15

Hello, I'm a newer for CaffeOnSpark. I have a same question , How to use myself model to detect image. you have solve it? Could you give me some example? Thanks!

rosszh avatar Apr 21 '17 07:04 rosszh