Jun Shi

Results 115 comments of Jun Shi

The error says "Requested # of executors: 2 actual # of executors:1". Somehow only 1 executor was available. Some of the cluster/job settings may be incorrect. Check CORES_PER_WORK, TOTAL_CORES, etc.

First, distributed training does not help in all cases. As you add more and more nodes to the cluster, communication cost increases. This is especially true if your model is...

This is synchronous training. The speed is limited by the slowest executor.

bandwith, latency, etc. depending on your network.

If you fixe the batch size in the prototxt file, but increase number of executors, you process more images per batch. It is not clear you will get better accuracy....

CaffeOnSpark uses protobuf 2.5 API, which is incompatible with protobuf 2.6. Try down-grade protobuf to 2.5

Yes, you can test with an existing model. https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2 You just need to remove the "-train -persistent" options. hadoop fs -rm -f /cifar10.model.h5 /cifar10_features_result spark-submit --master ${MASTER_URL} \ --files cifar10_quick_solver.prototxt,cifar10_quick_train_test.prototxt,mean.binaryproto...

"LMDB" is a data format, to use it, you need change "source_class" in lenet_memory_train_test.prototxt. We do not recommend "LMDB" for large data set since it is not a distributed data...

CaffeOnSpark will copy entire LMDB file to all executors, since the we can not really partition it without reading it first, as opposite to dataframe or sequencefile, where you can...

You can either install it on each node, or install it on the node where you launch your job and use spark-submit to ship the whole package to executors. In...