benchmark The design and optimization of API Benchmark

The design and optimization of API Benchmark

Open Xreki opened this issue 5 years ago • 1 comments

Nov 27 '19 08:11 Xreki

Program = feed + abs + fetch

profile数据

------------------------->     Profiling Report     <-------------------------

Place: All
Time unit: ms
Sorted by total time in descending order in the same thread

Event                               Calls       Total       CPU Time (Ratio)        GPU Time (Ratio)        Min.        Max.        Ave.        Ratio.
thread0::GpuMemcpySync:GPU->CPU     10          65.3952     39.898246 (0.610110)    25.496945 (0.389890)    6.46307     6.75515     6.53952     0.434411
thread0::fetch                      10          42.5865     37.686449 (0.884939)    4.900031 (0.115061)     4.15256     4.90003     4.25865     0.282896
thread0::TensorCopySync:GPU->CPU    10          41.8076     37.494616 (0.896837)    4.313012 (0.103163)     4.13309     4.31301     4.18076     0.277722
thread0::abs                        10          0.6688      0.450827 (0.674083)     0.217973 (0.325917)     0.052712    0.134069    0.06688     0.00444274
thread0::feed                       10          0.079468    0.064448 (0.810993)     0.015020 (0.189007)     0.005744    0.01502     0.0079468   0.000527895

{
  name: "abs",
  device: "GPU",
  precision: { stable: "True", diff: 0.00000 },
  speed: { repeat: 10, start: 1, end: 9, total: 5.08994, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
}

feed数据的CPU->GPU传输，是在Executor里面设置feed数据时已经开始传输，不是在feed op里面传输的
fetch数据的GPU->CPU传输是发生在fetch op里面，最下面gpu操作结束之后，cuda_api这一层还有很长的时间。

Nov 27 '19 08:11 Xreki

benchmark benchmark copied to clipboard

The design and optimization of API Benchmark

Program = feed + abs + fetch

benchmark
benchmark copied to clipboard