benchmarks
benchmarks copied to clipboard
Poor performance in distributed runing when batch_size is small (synthetic input)
Envs:
- Python: 2.7
- TensorFlow: 1.8.0
- CUDA: 9.0
- CuDNN: 7.0
- benchmarks commit id: fc993da280312ab65210e7e80bb6fa7f7489182e
- benchmarks commit date: Wed May 16 16:54:07 2018
- CPU: 48 * Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz
- GPU: 4 * K80
- Network: Ether, Speed: 10000Mb/s
- DMA: N Y N N Y N N N N N N Y N N Y N
On NVIDIA-P100 GPU, it takes 0.33s for each step. So I set the batch_size to 12 to simulate P100 on K80. It takes the same time to run a step.
single node, 1 GPU
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 12 global
12.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: parameter_server
==========
Step Img/sec total_loss
1 images/sec: 43.0 +/- 0.0 (jitter = 0.0) 7.949
10 images/sec: 42.9 +/- 0.1 (jitter = 0.1) 8.401
20 images/sec: 42.9 +/- 0.0 (jitter = 0.1) 8.277
30 images/sec: 42.9 +/- 0.0 (jitter = 0.1) 8.327
40 images/sec: 42.9 +/- 0.0 (jitter = 0.1) 7.884
50 images/sec: 42.9 +/- 0.0 (jitter = 0.2) 8.615
60 images/sec: 42.8 +/- 0.0 (jitter = 0.2) 8.247
70 images/sec: 42.8 +/- 0.0 (jitter = 0.2) 8.403
80 images/sec: 42.7 +/- 0.0 (jitter = 0.3) 8.437
90 images/sec: 42.7 +/- 0.0 (jitter = 0.3) 8.449
100 images/sec: 42.7 +/- 0.0 (jitter = 0.3) 7.793
----------------------------------------------------------------
total images/sec: 42.67
----------------------------------------------------------------
single node, 4 GPUs
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 48 global
12.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0', '/gpu:1', '/gpu:2', '/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
Step Img/sec total_loss
1 images/sec: 167.8 +/- 0.0 (jitter = 0.0) 8.382
10 images/sec: 168.4 +/- 0.2 (jitter = 1.0) 8.358
20 images/sec: 164.6 +/- 2.1 (jitter = 1.2) 8.446
30 images/sec: 163.4 +/- 1.7 (jitter = 1.8) 8.157
40 images/sec: 163.1 +/- 1.4 (jitter = 2.5) 8.124
50 images/sec: 162.4 +/- 1.4 (jitter = 2.7) 8.295
60 images/sec: 163.3 +/- 1.2 (jitter = 2.1) 8.275
70 images/sec: 163.8 +/- 1.1 (jitter = 2.0) 8.251
80 images/sec: 164.2 +/- 0.9 (jitter = 1.9) 8.072
90 images/sec: 164.5 +/- 0.8 (jitter = 1.8) 8.238
100 images/sec: 164.0 +/- 0.8 (jitter = 1.9) 8.497
----------------------------------------------------------------
total images/sec: 163.95
----------------------------------------------------------------
2 ps + 2 workers, 1 GPU each worker
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 24 global
12.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/job:worker/task:0/gpu:0']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_replicated
Sync: True
==========
Step Img/sec total_loss
1 images/sec: 40.4 +/- 0.0 (jitter = 0.0) 7.949
10 images/sec: 39.3 +/- 0.4 (jitter = 0.4) 8.401
20 images/sec: 39.3 +/- 0.3 (jitter = 0.8) 8.276
30 images/sec: 39.6 +/- 0.2 (jitter = 0.5) 8.327
40 images/sec: 39.8 +/- 0.2 (jitter = 0.4) 7.884
50 images/sec: 39.7 +/- 0.1 (jitter = 0.4) 8.609
60 images/sec: 39.8 +/- 0.1 (jitter = 0.4) 8.242
70 images/sec: 39.5 +/- 0.2 (jitter = 0.4) 8.407
80 images/sec: 39.6 +/- 0.2 (jitter = 0.4) 8.438
90 images/sec: 39.6 +/- 0.2 (jitter = 0.4) 8.455
100 images/sec: 39.6 +/- 0.2 (jitter = 0.4) 7.785
----------------------------------------------------------------
total images/sec: 79.08
----------------------------------------------------------------
2 ps + 2 workers, 4 GPUs each worker
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 96 global
12.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_replicated
Sync: True
==========
Step Img/sec total_loss
1 images/sec: 146.1 +/- 0.0 (jitter = 0.0) 8.382
10 images/sec: 130.4 +/- 3.2 (jitter = 0.9) 8.358
20 images/sec: 131.7 +/- 2.0 (jitter = 3.4) 8.446
30 images/sec: 131.1 +/- 1.5 (jitter = 3.9) 8.154
40 images/sec: 130.8 +/- 1.3 (jitter = 4.7) 8.129
50 images/sec: 130.9 +/- 1.2 (jitter = 4.7) 8.287
60 images/sec: 130.1 +/- 1.2 (jitter = 5.4) 8.286
70 images/sec: 130.4 +/- 1.1 (jitter = 5.5) 8.239
80 images/sec: 130.5 +/- 1.0 (jitter = 4.8) 8.073
90 images/sec: 131.3 +/- 1.0 (jitter = 5.5) 8.249
100 images/sec: 131.2 +/- 1.0 (jitter = 6.0) 8.489
----------------------------------------------------------------
total images/sec: 262.25
----------------------------------------------------------------
I have tried all the settings for variable_update
and all_reduce_spec
, the results above are the best. So, according to the results
- | Speed-Up |
---|---|
single node (4 GPUs) | 0.9606 |
2 workers (1 GPU each worker) | 0.9266 |
2 workers (4 GPUs each worker) | 0.7683 |
Both of multiple GPUs with single node and single GPU with multiple workers get good performance (speedup greater then 0.92). But multiple GPUs with multiple workers get poor performance. Any idea on this ?
BTW, I did some test on all-reduce and also got poor performance, the following result is the best with 2 workers (4 GPUs each worker)
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: True
Batch size: 96 global
12.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['job:worker/task0/gpu:0', 'job:worker/task0/gpu:1', 'job:worker/task0/gpu:2', 'job:worker/task0/gpu:3', 'job:worker/task1/gpu:0', 'job:worker/task1/gpu:1', 'job:worker/task1/gpu:2', 'job:worker/task1/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_all_reduce
AllReduce: pscpu:32k:xring
Sync: True
==========
Step Img/sec total_loss
1 images/sec: 162.9 +/- 0.0 (jitter = 0.0) 8.523
10 images/sec: 155.1 +/- 2.3 (jitter = 6.3) 8.278
20 images/sec: 154.7 +/- 1.9 (jitter = 4.8) 8.012
30 images/sec: 152.3 +/- 1.9 (jitter = 8.0) 8.361
40 images/sec: 152.7 +/- 1.5 (jitter = 8.8) 8.324
50 images/sec: 153.3 +/- 1.4 (jitter = 11.0) 8.451
60 images/sec: 154.0 +/- 1.2 (jitter = 9.8) 8.278
70 images/sec: 153.2 +/- 1.1 (jitter = 11.0) 8.327
80 images/sec: 153.2 +/- 1.0 (jitter = 11.0) 8.299
90 images/sec: 153.2 +/- 1.0 (jitter = 11.0) 8.230
100 images/sec: 153.5 +/- 0.9 (jitter = 10.4) 8.151
----------------------------------------------------------------
total images/sec: 153.39
----------------------------------------------------------------
Try --batch_size=32 or 64 To check if it's actually using the bandwidth between GPUs use nvidia visual profiler Weird, it's showing so less on synthetic dataset.
--batch_size==64 get better performance (Speedup=0.9443) as expected because there's more time for gradients and variables to transfer. But in high-performance hardware like NVIDIA-P100 which takes fewer time to compute forward and backward. In my situation, my network bandwidth is 10Gbps which I think is enough to cover the network transmission (100MB gradients and variables for ResNet-50)
single node, 1 GPU
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 64 global
64.0 per device
Num batches: 100
Num epochs: 0.00
Devices: ['/gpu:0']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: replicated
AllReduce: None
==========
1 images/sec: 55.5 +/- 0.0 (jitter = 0.0) 8.264
10 images/sec: 55.4 +/- 0.0 (jitter = 0.2) 8.258
20 images/sec: 55.4 +/- 0.0 (jitter = 0.2) 8.150
30 images/sec: 55.3 +/- 0.0 (jitter = 0.1) 8.235
40 images/sec: 55.4 +/- 0.0 (jitter = 0.1) 8.183
50 images/sec: 55.4 +/- 0.0 (jitter = 0.1) 8.342
60 images/sec: 55.4 +/- 0.0 (jitter = 0.1) 8.318
70 images/sec: 55.4 +/- 0.0 (jitter = 0.1) 8.385
80 images/sec: 55.5 +/- 0.0 (jitter = 0.2) 8.204
90 images/sec: 55.5 +/- 0.0 (jitter = 0.2) 8.310
100 images/sec: 55.5 +/- 0.0 (jitter = 0.3) 8.474
----------------------------------------------------------------
total images/sec: 55.54
----------------------------------------------------------------
2 ps + 2 workers (4 GPUs each worker)
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 512 global
64.0 per device
Num batches: 100
Num epochs: 0.04
Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_replicated
Sync: True
==========
Step Img/sec total_loss
1 images/sec: 211.3 +/- 0.0 (jitter = 0.0) 8.326
10 images/sec: 210.2 +/- 0.9 (jitter = 1.0) 8.348
20 images/sec: 208.9 +/- 0.7 (jitter = 3.5) 8.352
30 images/sec: 209.2 +/- 0.6 (jitter = 2.8) 8.358
40 images/sec: 209.2 +/- 0.5 (jitter = 2.6) 8.243
50 images/sec: 209.3 +/- 0.4 (jitter = 2.5) 8.246
60 images/sec: 209.5 +/- 0.3 (jitter = 2.1) 8.332
70 images/sec: 209.5 +/- 0.3 (jitter = 2.1) 8.288
80 images/sec: 209.4 +/- 0.3 (jitter = 2.0) 8.124
90 images/sec: 209.5 +/- 0.3 (jitter = 1.9) 8.181
100 images/sec: 209.5 +/- 0.3 (jitter = 1.9) 8.146
----------------------------------------------------------------
total images/sec: 418.91
----------------------------------------------------------------
What I can't understand is why single node with 4 GPUs
and 2 nodes with single GPU
both get 0.92 speedup while 2 node with 4 GPUs
get only 0.7683 speedup. (--batch_size=12)
I failed to upload my trace file for unknown reason. You can reproduce my results on similiar environments. I notice that jitter using batch_size=12 (around 10.0) is much larger than using batch_size=64 (around 2.0). What's reason of this ?
BTW, Why is there no timeline of MEMCPY in distributed tracing file ?
Currently we are not actively working on distributed performance, although we plan on doing so in the future. In general, low batch sizes are slower, and the faster the GPU, the bigger difference there is between high batch sizes and low batch sizes.
@cryptox31 @reedwm Look at this: I sleep 2 seconds after each session run and the performance is better.
Code:
def benchmark_one_step(...):
if image_producer is not None:
image_producer.notify_image_consumption()
train_time = time.time() - start_time
# ADD CODE
time.sleep(2)
# ADD CODE
step_train_times.append(train_time)
...
Log:
TensorFlow: 1.8
Model: resnet50
Dataset: imagenet (synthetic)
Mode: training
SingleSess: False
Batch size: 96 global
12.0 per device
Num batches: 100
Num epochs: 0.01
Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1', '/job:worker/task:0/gpu:2', '/job:worker/task:0/gpu:3']
Data format: NCHW
Layout optimizer: False
Optimizer: sgd
Variables: distributed_replicated
Sync: True
==========
Step Img/sec total_loss
1 images/sec: 164.7 +/- 0.0 (jitter = 0.0) 8.383
10 images/sec: 156.5 +/- 1.4 (jitter = 6.5) 8.358
20 images/sec: 155.5 +/- 1.0 (jitter = 4.4) 8.445
30 images/sec: 155.6 +/- 0.8 (jitter = 5.3) 8.155
40 images/sec: 155.7 +/- 0.7 (jitter = 6.1) 8.125
50 images/sec: 155.7 +/- 0.6 (jitter = 6.1) 8.291
60 images/sec: 156.2 +/- 0.6 (jitter = 5.8) 8.296
70 images/sec: 155.6 +/- 0.6 (jitter = 6.3) 8.245
80 images/sec: 155.3 +/- 0.6 (jitter = 6.1) 8.069
90 images/sec: 155.3 +/- 0.6 (jitter = 5.9) 8.246
100 images/sec: 155.3 +/- 0.5 (jitter = 5.9) 8.499
----------------------------------------------------------------
total images/sec: 41.54
----------------------------------------------------------------
The total images/sec
is lower because it takes my sleep time into account. According to the images/sec
of each step, the total images/sec
should be around 155 * 2 = 310, which means the speedup is 310/42.67/8 = 0.908.
If there's no sleep after each session run, the speedup is 0.7683.
2 seconds sleep somehow effect the performance of each session run. Why is that?
(Notice that jitter
in sleep version
is much lower than no sleep version
)
BTW,
In single node (1 GPU), the FPS is 42.67 and it takes 12 / 42.67 = 0.281 second for each step.
In single node (4 GPUs), the FPS is 163.95 and it takes 12 * 4 / 163.95 = 0.293 second for each step which means it takes additional 0.293 - 0.281 = 0.012 second in multi-GPUs
In 2 nodes (1 GPUs), the FPS is 79.08 and it takes 12 * 2 / 79.08 = 0.3035 second for each step which means it takes additional 0.3035 - 0.281 = 0.0225 second in multi-workers.
So it should take around 0.281 + 0.012 + 0.0225 = 0.3155 second for each step in 2 node (4 GPUs) which means the FPS should be around 12 * 8 / 0.3155 = 304.3.
(In no sleep version
, FPS is 262.25, In sleep version
, FPS is 310)
Any idea?
I cannot upload my trace file but if someone can reproduce my results, you'll find that at the end of each step, RecvOP of variables from ps0 -> worker0 and ps1 -> worker1 finished early (at 290 ms) but some of the RecvOP of variables from ps1 -> worker0 and ps0 -> worker1 finished late (at 330 ms). There's no other op between 290ms and 330ms.
(All RecvOP from worker -> ps finished early.)
(I used variable_update=distributed_replicated so variables are downloaded from ps to worker at the end of each step)
My network bandwidth is 10Gbps which I think is enough to cover the transmission of gradients and variables. And what confused me is that in sleep version
(sleep 2 seconds after each step), all RecvOP finished early at 290 ms.
Is there anything to do with my network? Can anyone reproduce my results?
@reedwm Can anyone help explain the difference between sleep version
and no sleep version
?