CaffeOnSpark
CaffeOnSpark copied to clipboard
Training time on a cluster is higher than training on one machine
I have a four g2.8xlarge aws nodes in my cluster. I converted image data of wround 35 GB to LMDB and then to DF.
I started the training with following configuration :
spark-submit --master spark://$(hostname):7077
--files /home/ubuntu/caffe/CaffeOnSpark/data/hdfs_data/WALNET/GOOGLE/DF/data_dataframe_train_val.prototxt,/home/ubuntu/caffe/CaffeOnSpark/data/hdfs_data/WALNET/GOOGLE/DF/data_dataframe_quick_solver.prototxt
--conf spark.scheduler.maxRegisteredResources=30s
--conf spark.executor.memory=12g
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-test
-conf /home/ubuntu/caffe/CaffeOnSpark/data/hdfs_data/WALNET/GOOGLE/DF/data_dataframe_quick_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model hdfs://master:9000/data/googlenet/DF/model/data_df_2.model
-output hdfs://master:9000/data/googlenet/DF/features/data_df_test_result2
Can you please help me in debugging.
@abhaymise Did you check the utility of GPU while training?