CaffeOnSpark Training time on a cluster is higher than training on one machine

Training time on a cluster is higher than training on one machine

Open abhaymise opened this issue 8 years ago • 1 comments

I have a four g2.8xlarge aws nodes in my cluster. I converted image data of wround 35 GB to LMDB and then to DF.

I started the training with following configuration :

spark-submit --master spark://$(hostname):7077
--files /home/ubuntu/caffe/CaffeOnSpark/data/hdfs_data/WALNET/GOOGLE/DF/data_dataframe_train_val.prototxt,/home/ubuntu/caffe/CaffeOnSpark/data/hdfs_data/WALNET/GOOGLE/DF/data_dataframe_quick_solver.prototxt
--conf spark.scheduler.maxRegisteredResources=30s
--conf spark.executor.memory=12g
--conf spark.cores.max=${TOTAL_CORES}
--conf spark.task.cpus=${CORES_PER_WORKER}
--conf spark.driver.extraLibraryPath="${LD_LIBRARY_PATH}"
--conf spark.executorEnv.LD_LIBRARY_PATH="${LD_LIBRARY_PATH}"
--class com.yahoo.ml.caffe.CaffeOnSpark
${CAFFE_ON_SPARK}/caffe-grid/target/caffe-grid-0.1-SNAPSHOT-jar-with-dependencies.jar
-train
-test
-conf /home/ubuntu/caffe/CaffeOnSpark/data/hdfs_data/WALNET/GOOGLE/DF/data_dataframe_quick_solver.prototxt
-clusterSize ${SPARK_WORKER_INSTANCES}
-devices ${DEVICES}
-connection ethernet
-model hdfs://master:9000/data/googlenet/DF/model/data_df_2.model
-output hdfs://master:9000/data/googlenet/DF/features/data_df_test_result2

Can you please help me in debugging.

Jul 20 '16 06:07 abhaymise

@abhaymise Did you check the utility of GPU while training?

Jul 22 '16 16:07 gnosisyuw

CaffeOnSpark CaffeOnSpark copied to clipboard

Training time on a cluster is higher than training on one machine

CaffeOnSpark
CaffeOnSpark copied to clipboard