ecosystem icon indicating copy to clipboard operation
ecosystem copied to clipboard

Distributed training is slow compared to single machine

Open saikishor opened this issue 7 years ago • 2 comments

I am trying to perform a distributed training on models from Tensorflow Object Detection models. I am able to perform distributed learning, but to my notice the distributed training is too slow compared to single machine training. As the step duration is high in distributed training than that of single machine. I am using Hadoop HDFS to feed data to all hosts .

From Distributed Training (All machines have more or less same step duration):

INFO:tensorflow:global step 1247: loss = 3.1943 (1.972 sec/step)

INFO:tensorflow:global step 1251: loss = 2.5009 (1.218 sec/step)

INFO:tensorflow:global step 1254: loss = 2.2030 (1.746 sec/step)

INFO:tensorflow:global step 1257: loss = 1.9776 (2.222 sec/step)

INFO:tensorflow:global step 1262: loss = 2.6990 (1.161 sec/step)

INFO:tensorflow:global step 1264: loss = 1.9501 (1.682 sec/step)

INFO:tensorflow:global step 1268: loss = 1.6125 (1.288 sec/step)

INFO:tensorflow:global step 1271: loss = 2.0434 (1.223 sec/step)

From single machine training:

INFO:tensorflow:global step 236: loss = 4.1127 (0.319 sec/step)

INFO:tensorflow:global step 237: loss = 3.0208 (0.316 sec/step)

INFO:tensorflow:global step 238: loss = 3.2838 (0.367 sec/step)

INFO:tensorflow:global step 239: loss = 2.9822 (0.324 sec/step)

INFO:tensorflow:global step 240: loss = 2.9753 (0.322 sec/step)

INFO:tensorflow:global step 249: loss = 2.8071 (0.318 sec/step)

INFO:tensorflow:global step 250: loss = 3.4335 (0.328 sec/step)

INFO:tensorflow:global step 251: loss = 4.3550 (0.322 sec/step)

Why do we encounter such difference between distributed learning as well as Single machine training?. Is there a way to resolve this?

saikishor avatar Nov 22 '17 13:11 saikishor

I am trying to perform a distributed training on models from Tensorflow Object Detection models. I am able to perform distributed learning, but to my notice the distributed training is too slow compared to single machine training. As the step duration is high in distributed training than that of single machine. I am using Hadoop HDFS to feed data to all hosts .

From Distributed Training (All machines have more or less same step duration):

INFO:tensorflow:global step 1247: loss = 3.1943 (1.972 sec/step)

INFO:tensorflow:global step 1251: loss = 2.5009 (1.218 sec/step)

INFO:tensorflow:global step 1254: loss = 2.2030 (1.746 sec/step)

INFO:tensorflow:global step 1257: loss = 1.9776 (2.222 sec/step)

INFO:tensorflow:global step 1262: loss = 2.6990 (1.161 sec/step)

INFO:tensorflow:global step 1264: loss = 1.9501 (1.682 sec/step)

INFO:tensorflow:global step 1268: loss = 1.6125 (1.288 sec/step)

INFO:tensorflow:global step 1271: loss = 2.0434 (1.223 sec/step)

From single machine training:

INFO:tensorflow:global step 236: loss = 4.1127 (0.319 sec/step)

INFO:tensorflow:global step 237: loss = 3.0208 (0.316 sec/step)

INFO:tensorflow:global step 238: loss = 3.2838 (0.367 sec/step)

INFO:tensorflow:global step 239: loss = 2.9822 (0.324 sec/step)

INFO:tensorflow:global step 240: loss = 2.9753 (0.322 sec/step)

INFO:tensorflow:global step 249: loss = 2.8071 (0.318 sec/step)

INFO:tensorflow:global step 250: loss = 3.4335 (0.328 sec/step)

INFO:tensorflow:global step 251: loss = 4.3550 (0.322 sec/step)

Why do we encounter such difference between distributed learning as well as Single machine training?. Is there a way to resolve this?

Can you share your codes with us?

lingyuhaunti avatar Nov 03 '18 15:11 lingyuhaunti

I have the same problem as you. When I use a spark cluster with two or more nodes, the algorithm needs the same time as if I have only one node. I don't understand why. Please, someone answer this doubt

DanyOrtu97 avatar Dec 19 '20 13:12 DanyOrtu97