seq2seq No Speedup for Multiple GPUs?

I just switched to using an 8 GPU AWS instance from a 1 GPU machine, same instance. The log shows that tensorflow finds the additional GPUs, but the log makes it seem that there's no significant speedup using the additional GPUs. When I was using 1 GPU, it was about 150 seconds for 100 steps, and it's still about the same on the bigger machine, as shown. Is there something else I need to do to enable a speedup? This is using the Google/seq2seq neural machine translation tutorial.

INFO:tensorflow:loss = 0.115634, step = 186203 (143.421 sec) INFO:tensorflow:Saving checkpoints for 186303 into /home/ubuntu/models/nmt_tutorial/large/model.ckpt. INFO:tensorflow:global_step/sec: 0.693705 INFO:tensorflow:loss = 0.22715, step = 186303 (144.154 sec)

May 31 '17 21:05 npowell88

I'm not part of this project, but multiple GPUs don't speed up an algorithm unless it's explicitly parallelized to take advantage of a multi-gpu environment.

In the wiki https://www.tensorflow.org/tutorials/using_gpu, there's a section on Using multiple GPUs.

I hope this helps!

Jun 12 '17 17:06 kevinpCroat

Hi Kevin, I saw that tutorial... But I didn't see any specific part of the code that deals with parallelization... In my case, without any gpu specification, all 8 gpus will be used. However, it's almost the same performance with the one from only one GPU... Sad.

Sep 07 '17 03:09 cocosci

Say, for d in ['/gpu:5','/gpu:6']: with tf.device(d): a = tf.Variable(tf.truncated_normal([200, 500], dtype=tf.float32)) b = tf.Variable(tf.truncated_normal([500, 300], dtype=tf.float32)) b2 = tf.Variable(tf.truncated_normal( [300, 400], dtype=tf.float32)) b3 = tf.Variable(tf.truncated_normal( [400, 1000], dtype=tf.float32)) b4 = tf.Variable(tf.truncated_normal( [1000, 500], dtype=tf.float32)) b5 = tf.Variable(tf.truncated_normal( [500, 100], dtype=tf.float32)) b6 = tf.Variable(tf.truncated_normal( [100, 500], dtype=tf.float32)) res1 = tf.matmul(a, b) res2 = tf.matmul(res1, b2) res3 = tf.matmul(res2, b3) res4 = tf.matmul(res3, b4) res5 = tf.matmul(res4, b5) res6 = tf.matmul(res5, b6) c.append(res6)

The task won't be allocated to gpu:6 until the task for gpu:5 is completed, right? In that case, it's still sequential....

Sep 07 '17 03:09 cocosci