Jun Shi comments

Results 115 comments of


                                            Jun Shi

something wrong with standalone cluster

The error says "Requested # of executors: 2 actual # of executors:1". Somehow only 1 executor was available. Some of the cluster/job settings may be incorrect. Check CORES_PER_WORK, TOTAL_CORES, etc.

Training getting slower with more more Spark executors?

First, distributed training does not help in all cases. As you add more and more nodes to the cluster, communication cost increases. This is especially true if your model is...

Training getting slower with more more Spark executors?

This is synchronous training. The speed is limited by the slowest executor.

Training getting slower with more more Spark executors?

bandwith, latency, etc. depending on your network.

Training getting slower with more more Spark executors?

If you fixe the batch size in the prototxt file, but increase number of executors, you process more images per batch. It is not clear you will get better accuracy....

Error while "make build" --[caffe-grid Failure]

CaffeOnSpark uses protobuf 2.5 API, which is incompatible with protobuf 2.6. Try down-grade protobuf to 2.5

How to run CaffeOnSpark with a pre-existing model?

Yes, you can test with an existing model. https://github.com/yahoo/CaffeOnSpark/wiki/GetStarted_EC2 You just need to remove the "-train -persistent" options. hadoop fs -rm -f /cifar10.model.h5 /cifar10_features_result spark-submit --master ${MASTER_URL} \ --files cifar10_quick_solver.prototxt,cifar10_quick_train_test.prototxt,mean.binaryproto...