benchmark
benchmark copied to clipboard
Optimize the performance of seq2seq model on GPU
初始性能
- 测试时间:2019年8月8日
- 测试者:@Xreki
Profile结果
-------------------------> Profiling Report <-------------------------
Note! This Report merge all thread info into one.
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
recurrent_grad 20 3334.05 3130.058004 (0.938815) 203.995332 (0.061185) 27.1338 260.926 166.703 0.308526
recurrent 20 1497.97 1452.955293 (0.969947) 45.018247 (0.030053) 16.0935 110.292 74.8987 0.138619
elementwise_add_grad 4795 1377.74 168.491482 (0.122296) 1209.244141 (0.877704) 0.026835 1.0058 0.287328 0.127493
matmul_grad 3196 844.682 311.706298 (0.369022) 532.976110 (0.630978) 0.087976 13.3952 0.264294 0.0781651
matmul 3196 526.856 173.302980 (0.328938) 353.553345 (0.671062) 0.036523 8.69278 0.164849 0.0487542
sum 12389 503.771 391.555491 (0.777248) 112.215826 (0.222752) 0.021764 6.88414 0.0406628 0.0466179
GpuMemcpyAsync(same_gpu):GPU->GPU 18475 334.162 303.616585 (0.908592) 30.544991 (0.091408) 0.013014 0.120487 0.0180872 0.0309226
elementwise_mul_grad 6734 256.62 238.810646 (0.930602) 17.808921 (0.069398) 0.027794 7.31021 0.038108 0.023747
elementwise_mul 6864 203.549 187.715197 (0.922212) 15.833627 (0.077788) 0.020651 1.87285 0.0296545 0.018836
elementwise_add 4795 164.711 153.425376 (0.931485) 11.285194 (0.068515) 0.021858 1.79897 0.0343505 0.015242
concat 3794 164.522 137.125469 (0.833479) 27.396308 (0.166521) 0.029323 0.125735 0.0433637 0.0152245
rnn_memory_helper_grad 5244 140.187 140.106174 (0.999421) 0.081113 (0.000579) 0.020427 0.150314 0.0267329 0.0129726
rnn_memory_helper 5244 134.466 134.422311 (0.999676) 0.043546 (0.000324) 0.02023 0.131774 0.0256418 0.0124432
concat_grad 2340 114.213 102.730713 (0.899469) 11.481954 (0.100531) 0.032481 0.15419 0.0488088 0.010569
sigmoid_grad 4362 113.874 107.131844 (0.940796) 6.741797 (0.059204) 0.020312 0.086206 0.0261058 0.0105376
sigmoid 4362 108.18 101.646670 (0.939606) 6.533444 (0.060394) 0.019966 0.132063 0.0248006 0.0100108
elementwise_sub 2362 79.7406 75.840477 (0.951089) 3.900169 (0.048911) 0.027971 0.095837 0.0337598 0.00737903
tanh_grad 2908 76.4149 71.989046 (0.942082) 4.425817 (0.057918) 0.021056 0.090353 0.0262775 0.00707127
split 1454 75.8946 60.162568 (0.792712) 15.732015 (0.207288) 0.043571 0.630367 0.0521971 0.00702312
elementwise_sub_grad 2352 71.542 68.192457 (0.953180) 3.349583 (0.046820) 0.022459 0.168436 0.0304175 0.00662035
tanh 2908 67.9909 63.220004 (0.929830) 4.770911 (0.070170) 0.019094 0.128054 0.0233806 0.00629174
reshape2 1762 67.5811 67.563221 (0.999736) 0.017841 (0.000264) 0.012027 0.104218 0.0383547 0.00625381
dropout 1454 67.5326 52.996526 (0.784755) 14.536027 (0.215245) 0.038525 0.110662 0.046446 0.00624932
dropout_grad 1454 50.5008 47.855241 (0.947613) 2.645600 (0.052387) 0.027463 0.624477 0.0347324 0.00467324
fill_constant 1496 41.2608 38.647031 (0.936653) 2.613757 (0.063347) 0.019868 0.110909 0.0275807 0.00381819
GpuMemcpyAsync:CPU->GPU 2684 38.2639 33.139905 (0.866089) 5.123964 (0.133911) 0.008246 0.077719 0.0142563 0.00354086
softmax 433 37.6717 32.780017 (0.870150) 4.891664 (0.129850) 0.075404 0.126331 0.0870016 0.00348606
transpose2_grad 906 29.4507 26.412712 (0.896845) 3.037988 (0.103155) 0.023196 0.156133 0.0325063 0.00272531
unsqueeze2 433 29.4025 29.319538 (0.997179) 0.082947 (0.002821) 0.059813 0.128764 0.0679041 0.00272084
transpose2 916 28.7084 25.577694 (0.890949) 3.130678 (0.109051) 0.023465 0.092804 0.031341 0.00265661
softmax_grad 433 27.5446 23.813278 (0.864535) 3.731316 (0.135465) 0.05221 0.123997 0.0636134 0.00254892
squeeze2_grad 433 27.0135 26.944857 (0.997458) 0.068676 (0.002542) 0.054414 0.133838 0.0623869 0.00249978
eager_deletion 870 26.7456 26.736183 (0.999646) 0.009456 (0.000354) 0.002324 1.38676 0.0307421 0.00247499
squeeze2 433 25.2393 25.181926 (0.997725) 0.057410 (0.002275) 0.050768 0.135624 0.0582895 0.00233559
unsqueeze2_grad 433 25.0274 24.968753 (0.997655) 0.058683 (0.002345) 0.050575 0.119195 0.0578001 0.00231599
softmax_with_cross_entropy 10 18.02 2.409284 (0.133700) 15.610725 (0.866300) 0.722111 2.15961 1.802 0.00166753
adam 130 14.9652 5.707370 (0.381375) 9.257877 (0.618625) 0.034774 0.354246 0.115117 0.00138485
reduce_sum 140 10.8105 8.748800 (0.809285) 2.061724 (0.190715) 0.037435 0.162729 0.077218 0.00100038
square 130 6.64256 3.823993 (0.575681) 2.818564 (0.424319) 0.022074 0.138738 0.0510966 0.000614688
scale 260 6.51412 6.142491 (0.942950) 0.371628 (0.057050) 0.019617 0.263 0.0250543 0.000602803
lookup_table_grad 20 5.0277 1.516919 (0.301712) 3.510781 (0.698288) 0.131618 0.334729 0.251385 0.000465253
softmax_with_cross_entropy_grad 10 5.01725 0.894024 (0.178190) 4.123227 (0.821810) 0.222553 0.583161 0.501725 0.000464286
sequence_mask 30 4.32902 4.122014 (0.952182) 0.207004 (0.047818) 0.109838 0.224241 0.144301 0.000400598
slice_grad 80 3.75367 3.079189 (0.820314) 0.674482 (0.179686) 0.028986 0.145544 0.0469209 0.000347357
slice 110 3.24341 3.036975 (0.936351) 0.206440 (0.063649) 0.012624 0.076516 0.0294856 0.000300139
lookup_table 20 3.21076 0.940230 (0.292837) 2.270531 (0.707163) 0.056535 0.242628 0.160538 0.000297117
fill_constant_batch_size_like 50 1.41506 1.316025 (0.930012) 0.099038 (0.069988) 0.023255 0.042394 0.0283013 0.000130947
TensorCopy:GPU->CPU 30 1.08744 1.044846 (0.960830) 0.042595 (0.039170) 0.03174 0.046723 0.036248 0.000100629
GpuMemcpySync:GPU->CPU 30 0.990879 0.899023 (0.907298) 0.091856 (0.092702) 0.029195 0.04386 0.0330293 9.16939e-05
TensorCopy:CPU->GPU 30 0.986733 0.953919 (0.966745) 0.032814 (0.033255) 0.027728 0.046333 0.0328911 9.13102e-05
Fetch 10 0.946421 0.866478 (0.915531) 0.079943 (0.084469) 0.079943 0.133364 0.0946421 8.75798e-05
GpuMemcpySync:CPU->GPU 30 0.904039 0.824342 (0.911843) 0.079697 (0.088157) 0.024434 0.042919 0.0301346 8.36579e-05
reduce_sum_grad 10 0.671832 0.578052 (0.860412) 0.093780 (0.139588) 0.063234 0.076912 0.0671832 6.21699e-05
reduce_mean 10 0.660142 0.576509 (0.873311) 0.083633 (0.126689) 0.054697 0.114415 0.0660142 6.10882e-05
elementwise_max 10 0.558056 0.495460 (0.887832) 0.062596 (0.112168) 0.047459 0.07431 0.0558056 5.16413e-05
reshape2_grad 30 0.51679 0.500414 (0.968312) 0.016376 (0.031688) 0.013623 0.02873 0.0172263 4.78227e-05
GpuMemcpyAsync:GPU->CPU 10 0.466456 0.409467 (0.877826) 0.056989 (0.122174) 0.039485 0.063719 0.0466456 4.31649e-05
FastThreadedSSAGraphExecutorPrepare 10 0.453106 0.396860 (0.875866) 0.056246 (0.124134) 0.039105 0.056246 0.0453106 4.19295e-05
elementwise_div 10 0.446607 0.392265 (0.878323) 0.054342 (0.121677) 0.041335 0.060296 0.0446607 4.13281e-05
reduce_mean_grad 10 0.436627 0.371762 (0.851441) 0.064865 (0.148559) 0.040098 0.056068 0.0436627 4.04045e-05
sqrt 10 0.428603 0.366501 (0.855106) 0.062102 (0.144894) 0.034099 0.062782 0.0428603 3.9662e-05
Scale LossGrad 10 0.371738 0.334618 (0.900145) 0.037120 (0.099855) 0.032202 0.048616 0.0371738 3.43999e-05
shape 30 0.36791 0.342755 (0.931627) 0.025155 (0.068373) 0.008387 0.028157 0.0122637 3.40456e-05
TensorCopy:GPU->GPU 30 0.06052 0.058630 (0.968771) 0.001890 (0.031229) 0.001675 0.002818 0.00201733 5.60039e-06
timeline分析
- step总览:
- 模型里面用到了两个StaticRNN结构?
- 可以看到明显的GPU空闲时间
- recurrent开始需要创建Operators,准备ExecutorPrepareContext。每个step用自己的step_scope,需创建Variables。
- 无论是前面的准备部分,还是StaticRNN里面,GPU利用率都不高。
- 这些都是elementwise计算,可以考虑融合起来。正在开发支持这类融合的通用方法。
-
rnn_memory_helper引入很多GPU <-> GPU之间的内存拷贝,应该考虑从设计上优化。
-
rerecurrent_grad中有个同步,需要很长时间的等待。
- 梯度的聚合,使用很多sum_op(已经有PR在优化)