benchmarks Poor performance in distributed runing when batch

Envs:

Python: 2.7
TensorFlow: 1.8.0
CUDA: 9.0
CuDNN: 7.0
benchmarks commit id: fc993da280312ab65210e7e80bb6fa7f7489182e
benchmarks commit date: Wed May 16 16:54:07 2018
CPU: 48 * Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
GPU: 4 * K80
Network: Ether, Speed: 10000Mb/s
DMA: N Y N N Y N N N N N N Y N N Y N

On NVIDIA-P100 GPU, it takes 0.33s for each step. So I set the batch_size to 12 to simulate P100 on K80. It takes the same time to run a step.

single node, 1 GPU

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  12 global
             12.0 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   parameter_server
==========
Step	Img/sec	total_loss
1	images/sec: 43.0 +/- 0.0 (jitter = 0.0)	7.949
10	images/sec: 42.9 +/- 0.1 (jitter = 0.1)	8.401
20	images/sec: 42.9 +/- 0.0 (jitter = 0.1)	8.277
30	images/sec: 42.9 +/- 0.0 (jitter = 0.1)	8.327
40	images/sec: 42.9 +/- 0.0 (jitter = 0.1)	7.884
50	images/sec: 42.9 +/- 0.0 (jitter = 0.2)	8.615
60	images/sec: 42.8 +/- 0.0 (jitter = 0.2)	8.247
70	images/sec: 42.8 +/- 0.0 (jitter = 0.2)	8.403
80	images/sec: 42.7 +/- 0.0 (jitter = 0.3)	8.437
90	images/sec: 42.7 +/- 0.0 (jitter = 0.3)	8.449
100	images/sec: 42.7 +/- 0.0 (jitter = 0.3)	7.793
----------------------------------------------------------------
total images/sec: 42.67
----------------------------------------------------------------

single node, 4 GPUs

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  48 global
             12.0 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   replicated
AllReduce:   None
==========
Step	Img/sec	total_loss
1	images/sec: 167.8 +/- 0.0 (jitter = 0.0)	8.382
10	images/sec: 168.4 +/- 0.2 (jitter = 1.0)	8.358
20	images/sec: 164.6 +/- 2.1 (jitter = 1.2)	8.446
30	images/sec: 163.4 +/- 1.7 (jitter = 1.8)	8.157
40	images/sec: 163.1 +/- 1.4 (jitter = 2.5)	8.124
50	images/sec: 162.4 +/- 1.4 (jitter = 2.7)	8.295
60	images/sec: 163.3 +/- 1.2 (jitter = 2.1)	8.275
70	images/sec: 163.8 +/- 1.1 (jitter = 2.0)	8.251
80	images/sec: 164.2 +/- 0.9 (jitter = 1.9)	8.072
90	images/sec: 164.5 +/- 0.8 (jitter = 1.8)	8.238
100	images/sec: 164.0 +/- 0.8 (jitter = 1.9)	8.497
----------------------------------------------------------------
total images/sec: 163.95
----------------------------------------------------------------

2 ps + 2 workers, 1 GPU each worker

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  24 global
             12.0 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/job:worker/task:0/gpu:0']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   distributed_replicated
Sync:        True
==========
Step	Img/sec	total_loss
1	images/sec: 40.4 +/- 0.0 (jitter = 0.0)	7.949
10	images/sec: 39.3 +/- 0.4 (jitter = 0.4)	8.401
20	images/sec: 39.3 +/- 0.3 (jitter = 0.8)	8.276
30	images/sec: 39.6 +/- 0.2 (jitter = 0.5)	8.327
40	images/sec: 39.8 +/- 0.2 (jitter = 0.4)	7.884
50	images/sec: 39.7 +/- 0.1 (jitter = 0.4)	8.609
60	images/sec: 39.8 +/- 0.1 (jitter = 0.4)	8.242
70	images/sec: 39.5 +/- 0.2 (jitter = 0.4)	8.407
80	images/sec: 39.6 +/- 0.2 (jitter = 0.4)	8.438
90	images/sec: 39.6 +/- 0.2 (jitter = 0.4)	8.455
100	images/sec: 39.6 +/- 0.2 (jitter = 0.4)	7.785
----------------------------------------------------------------
total images/sec: 79.08
----------------------------------------------------------------

2 ps + 2 workers, 4 GPUs each worker

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  96 global
             12.0 per device
Num batches: 100
Num epochs:  0.01
Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   distributed_replicated
Sync:        True
==========
Step	Img/sec	total_loss
1	images/sec: 146.1 +/- 0.0 (jitter = 0.0)	8.382
10	images/sec: 130.4 +/- 3.2 (jitter = 0.9)	8.358
20	images/sec: 131.7 +/- 2.0 (jitter = 3.4)	8.446
30	images/sec: 131.1 +/- 1.5 (jitter = 3.9)	8.154
40	images/sec: 130.8 +/- 1.3 (jitter = 4.7)	8.129
50	images/sec: 130.9 +/- 1.2 (jitter = 4.7)	8.287
60	images/sec: 130.1 +/- 1.2 (jitter = 5.4)	8.286
70	images/sec: 130.4 +/- 1.1 (jitter = 5.5)	8.239
80	images/sec: 130.5 +/- 1.0 (jitter = 4.8)	8.073
90	images/sec: 131.3 +/- 1.0 (jitter = 5.5)	8.249
100	images/sec: 131.2 +/- 1.0 (jitter = 6.0)	8.489
----------------------------------------------------------------
total images/sec: 262.25
----------------------------------------------------------------

I have tried all the settings for variable_update and all_reduce_spec, the results above are the best. So, according to the results

-	Speed-Up
single node (4 GPUs)	0.9606
2 workers (1 GPU each worker)	0.9266
2 workers (4 GPUs each worker)	0.7683

Both of multiple GPUs with single node and single GPU with multiple workers get good performance (speedup greater then 0.92). But multiple GPUs with multiple workers get poor performance. Any idea on this ?

BTW, I did some test on all-reduce and also got poor performance, the following result is the best with 2 workers (4 GPUs each worker)

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  True
Batch size:  96 global
             12.0 per device
Num batches: 100
Num epochs:  0.01
Devices:     ['job:worker/task0/gpu:0', 'job:worker/task0/gpu:1', 'job:worker/task0/gpu:2', 'job:worker/task0/gpu:3', 'job:worker/task1/gpu:0', 'job:worker/task1/gpu:1', 'job:worker/task1/gpu:2', 'job:worker/task1/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   distributed_all_reduce
AllReduce:   pscpu:32k:xring
Sync:        True
==========
Step	Img/sec	total_loss
1	images/sec: 162.9 +/- 0.0 (jitter = 0.0)	8.523
10	images/sec: 155.1 +/- 2.3 (jitter = 6.3)	8.278
20	images/sec: 154.7 +/- 1.9 (jitter = 4.8)	8.012
30	images/sec: 152.3 +/- 1.9 (jitter = 8.0)	8.361
40	images/sec: 152.7 +/- 1.5 (jitter = 8.8)	8.324
50	images/sec: 153.3 +/- 1.4 (jitter = 11.0)	8.451
60	images/sec: 154.0 +/- 1.2 (jitter = 9.8)	8.278
70	images/sec: 153.2 +/- 1.1 (jitter = 11.0)	8.327
80	images/sec: 153.2 +/- 1.0 (jitter = 11.0)	8.299
90	images/sec: 153.2 +/- 1.0 (jitter = 11.0)	8.230
100	images/sec: 153.5 +/- 0.9 (jitter = 10.4)	8.151
----------------------------------------------------------------
total images/sec: 153.39
----------------------------------------------------------------

Jun 26 '18 08:06 sleepfin

Try --batch_size=32 or 64 To check if it's actually using the bandwidth between GPUs use nvidia visual profiler Weird, it's showing so less on synthetic dataset.

Jun 26 '18 17:06 ilovechai

--batch_size==64 get better performance (Speedup=0.9443) as expected because there's more time for gradients and variables to transfer. But in high-performance hardware like NVIDIA-P100 which takes fewer time to compute forward and backward. In my situation, my network bandwidth is 10Gbps which I think is enough to cover the network transmission (100MB gradients and variables for ResNet-50)

single node, 1 GPU

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  64 global
             64.0 per device
Num batches: 100
Num epochs:  0.00
Devices:     ['/gpu:0']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   replicated
AllReduce:   None
==========
1	images/sec: 55.5 +/- 0.0 (jitter = 0.0)	8.264
10	images/sec: 55.4 +/- 0.0 (jitter = 0.2)	8.258
20	images/sec: 55.4 +/- 0.0 (jitter = 0.2)	8.150
30	images/sec: 55.3 +/- 0.0 (jitter = 0.1)	8.235
40	images/sec: 55.4 +/- 0.0 (jitter = 0.1)	8.183
50	images/sec: 55.4 +/- 0.0 (jitter = 0.1)	8.342
60	images/sec: 55.4 +/- 0.0 (jitter = 0.1)	8.318
70	images/sec: 55.4 +/- 0.0 (jitter = 0.1)	8.385
80	images/sec: 55.5 +/- 0.0 (jitter = 0.2)	8.204
90	images/sec: 55.5 +/- 0.0 (jitter = 0.2)	8.310
100	images/sec: 55.5 +/- 0.0 (jitter = 0.3)	8.474
----------------------------------------------------------------
total images/sec: 55.54
----------------------------------------------------------------

2 ps + 2 workers (4 GPUs each worker)

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  512 global
             64.0 per device
Num batches: 100
Num epochs:  0.04
Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   distributed_replicated
Sync:        True
==========
Step	Img/sec	total_loss
1	images/sec: 211.3 +/- 0.0 (jitter = 0.0)	8.326
10	images/sec: 210.2 +/- 0.9 (jitter = 1.0)	8.348
20	images/sec: 208.9 +/- 0.7 (jitter = 3.5)	8.352
30	images/sec: 209.2 +/- 0.6 (jitter = 2.8)	8.358
40	images/sec: 209.2 +/- 0.5 (jitter = 2.6)	8.243
50	images/sec: 209.3 +/- 0.4 (jitter = 2.5)	8.246
60	images/sec: 209.5 +/- 0.3 (jitter = 2.1)	8.332
70	images/sec: 209.5 +/- 0.3 (jitter = 2.1)	8.288
80	images/sec: 209.4 +/- 0.3 (jitter = 2.0)	8.124
90	images/sec: 209.5 +/- 0.3 (jitter = 1.9)	8.181
100	images/sec: 209.5 +/- 0.3 (jitter = 1.9)	8.146
----------------------------------------------------------------
total images/sec: 418.91
----------------------------------------------------------------

What I can't understand is why single node with 4 GPUs and 2 nodes with single GPU both get 0.92 speedup while 2 node with 4 GPUs get only 0.7683 speedup. (--batch_size=12)

Jun 27 '18 07:06 sleepfin

I failed to upload my trace file for unknown reason. You can reproduce my results on similiar environments. I notice that jitter using batch_size=12 (around 10.0) is much larger than using batch_size=64 (around 2.0). What's reason of this ?

BTW, Why is there no timeline of MEMCPY in distributed tracing file ?

Jun 27 '18 07:06 sleepfin

Currently we are not actively working on distributed performance, although we plan on doing so in the future. In general, low batch sizes are slower, and the faster the GPU, the bigger difference there is between high batch sizes and low batch sizes.

Jun 28 '18 22:06 reedwm

@cryptox31 @reedwm Look at this: I sleep 2 seconds after each session run and the performance is better.

Code:

def benchmark_one_step(...):
  if image_producer is not None:
    image_producer.notify_image_consumption()
  train_time = time.time() - start_time
  # ADD CODE
  time.sleep(2)
  # ADD CODE
  step_train_times.append(train_time)
...

Log:

TensorFlow:  1.8
Model:       resnet50
Dataset:     imagenet (synthetic)
Mode:        training
SingleSess:  False
Batch size:  96 global
             12.0 per device
Num batches: 100
Num epochs:  0.01
Devices:     ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer:   sgd
Variables:   distributed_replicated
Sync:        True
==========
Step	Img/sec	total_loss
1	images/sec: 164.7 +/- 0.0 (jitter = 0.0)	8.383
10	images/sec: 156.5 +/- 1.4 (jitter = 6.5)	8.358
20	images/sec: 155.5 +/- 1.0 (jitter = 4.4)	8.445
30	images/sec: 155.6 +/- 0.8 (jitter = 5.3)	8.155
40	images/sec: 155.7 +/- 0.7 (jitter = 6.1)	8.125
50	images/sec: 155.7 +/- 0.6 (jitter = 6.1)	8.291
60	images/sec: 156.2 +/- 0.6 (jitter = 5.8)	8.296
70	images/sec: 155.6 +/- 0.6 (jitter = 6.3)	8.245
80	images/sec: 155.3 +/- 0.6 (jitter = 6.1)	8.069
90	images/sec: 155.3 +/- 0.6 (jitter = 5.9)	8.246
100	images/sec: 155.3 +/- 0.5 (jitter = 5.9)	8.499
----------------------------------------------------------------
total images/sec: 41.54
----------------------------------------------------------------

The total images/sec is lower because it takes my sleep time into account. According to the images/sec of each step, the total images/sec should be around 155 * 2 = 310, which means the speedup is 310/42.67/8 = 0.908. If there's no sleep after each session run, the speedup is 0.7683. 2 seconds sleep somehow effect the performance of each session run. Why is that? (Notice that jitter in sleep version is much lower than no sleep version)

Jul 02 '18 02:07 sleepfin

BTW,

In single node (1 GPU), the FPS is 42.67 and it takes 12 / 42.67 = 0.281 second for each step.

In single node (4 GPUs), the FPS is 163.95 and it takes 12 * 4 / 163.95 = 0.293 second for each step which means it takes additional 0.293 - 0.281 = 0.012 second in multi-GPUs

In 2 nodes (1 GPUs), the FPS is 79.08 and it takes 12 * 2 / 79.08 = 0.3035 second for each step which means it takes additional 0.3035 - 0.281 = 0.0225 second in multi-workers.

So it should take around 0.281 + 0.012 + 0.0225 = 0.3155 second for each step in 2 node (4 GPUs) which means the FPS should be around 12 * 8 / 0.3155 = 304.3.

(In no sleep version, FPS is 262.25, In sleep version, FPS is 310)

Any idea?

Jul 02 '18 02:07 sleepfin

I cannot upload my trace file but if someone can reproduce my results, you'll find that at the end of each step, RecvOP of variables from ps0 -> worker0 and ps1 -> worker1 finished early (at 290 ms) but some of the RecvOP of variables from ps1 -> worker0 and ps0 -> worker1 finished late (at 330 ms). There's no other op between 290ms and 330ms. (All RecvOP from worker -> ps finished early.) (I used variable_update=distributed_replicated so variables are downloaded from ps to worker at the end of each step) My network bandwidth is 10Gbps which I think is enough to cover the transmission of gradients and variables. And what confused me is that in sleep version(sleep 2 seconds after each step), all RecvOP finished early at 290 ms. Is there anything to do with my network? Can anyone reproduce my results?

Jul 03 '18 01:07 sleepfin

@reedwm Can anyone help explain the difference between sleep version and no sleep version ?

Aug 06 '18 01:08 sleepfin

benchmarks
benchmarks copied to clipboard

Poor performance in distributed runing when batch_size is small (synthetic input)

Envs:

single node, 1 GPU

single node, 4 GPUs

2 ps + 2 workers, 1 GPU each worker

2 ps + 2 workers, 4 GPUs each worker

single node, 1 GPU

2 ps + 2 workers (4 GPUs each worker)

benchmarks benchmarks copied to clipboard

Poor performance in distributed runing when batch_size is small (synthetic input)

Envs:

single node, 1 GPU

single node, 4 GPUs

2 ps + 2 workers, 1 GPU each worker

2 ps + 2 workers, 4 GPUs each worker

single node, 1 GPU

2 ps + 2 workers (4 GPUs each worker)

benchmarks
benchmarks copied to clipboard