benchmark copied to clipboard
The design and optimization of API Benchmark
Program = feed + abs + fetch
- profile数据
-------------------------> Profiling Report <-------------------------
Place: All
Time unit: ms
Sorted by total time in descending order in the same thread
Event Calls Total CPU Time (Ratio) GPU Time (Ratio) Min. Max. Ave. Ratio.
thread0::GpuMemcpySync:GPU->CPU 10 65.3952 39.898246 (0.610110) 25.496945 (0.389890) 6.46307 6.75515 6.53952 0.434411
thread0::fetch 10 42.5865 37.686449 (0.884939) 4.900031 (0.115061) 4.15256 4.90003 4.25865 0.282896
thread0::TensorCopySync:GPU->CPU 10 41.8076 37.494616 (0.896837) 4.313012 (0.103163) 4.13309 4.31301 4.18076 0.277722
thread0::abs 10 0.6688 0.450827 (0.674083) 0.217973 (0.325917) 0.052712 0.134069 0.06688 0.00444274
thread0::feed 10 0.079468 0.064448 (0.810993) 0.015020 (0.189007) 0.005744 0.01502 0.0079468 0.000527895
name: "abs",
device: "GPU",
precision: { stable: "True", diff: 0.00000 },
speed: { repeat: 10, start: 1, end: 9, total: 5.08994, feed: 0.00000, compute: 0.00000, fetch: 0.00000 }
feed数据的CPU->GPU传输,是在Executor里面设置feed数据时已经开始传输,不是在feed op里面传输的
fetch数据的GPU->CPU传输是发生在fetch op里面,最下面gpu操作结束之后,cuda_api这一层还有很长的时间。