CaffeOnSpark Training getting slower with more more Spark executors?

Training with cifar10 datasets following the steps in GetStarted_yarn:

with 1 executor, it take 7 minites,
with 2, it takes 12 minites
with 4 it takes 23 minites

Apr 08 '17 10:04 apli

First, distributed training does not help in all cases. As you add more and more nodes to the cluster, communication cost increases. This is especially true if your model is large.

Second, you did not mention the batch size. Maybe you were comparing apples and oranges. Let's say you set batch size = 32. With 4 executors (and 1 gpu per executor), you are getting effectively 4*32=128 batch size. So the 4-node cluster has as 4X work-load as 1-node cluster. If you set batch size = 32 for single-node, and batch_size = 8 for 4-node cluster, then it is a fair comparison. But In the latter case, communication becomes bottleneck since the GPUs are likely idle most of the time, waiting to be fed.

Apr 10 '17 16:04 junshi15

Thanks,@junshi15 I think I get what you mean. Anyway,with more executors,I could set bigger batch size(the actual batch size = batch size * num of executors) to make full use of GPUS comparing to single-node. Is that correct?

Apr 12 '17 07:04 apli

Another question, If I have two executers(1 gpu per executor), the gpu of one is idle and another is busy.Does the time cost of training depends mainly on the training time of the busy executer without consideration of communication?

Apr 12 '17 07:04 apli

This is synchronous training. The speed is limited by the slowest executor.

Apr 12 '17 21:04 junshi15

What's the main factor that affect the communication, the bandwith?

Apr 20 '17 11:04 apli

bandwith, latency, etc. depending on your network.

Apr 20 '17 21:04 junshi15

Just to clarify: Does the accuracy improves, when I don't decrease the batchsize but increase the number of executors? When I understood it correctly, more batches are processed then. Or is there any other measurable "benefit", when I don't deacrease the batchsize?

Nov 07 '17 10:11 mumlax

If you fixe the batch size in the prototxt file, but increase number of executors, you process more images per batch. It is not clear you will get better accuracy. There are many things you want to tune. Folks at Facebook manage to do just that.

Nov 07 '17 14:11 junshi15