Tengine 在板子「Orange Pi 4(RK3399)」上运行yolov4-tiny，为何用GPU处理的速度比用CPU慢呢？

尊敬的Tengine团队：不好意思，打扰了，有一个问题希望得到你们的帮助。

■问题在板子「Orange Pi 4(RK3399)」上，我做了一个测试代码，用GPU去运行yolov4-tiny。结果发现，用GPU(OpenCL)处理时的速度，居然比用CPU(NEON)慢。这是为什么呢？是因为现在Tengine-Lite还没有完善对yolov4-tiny的匹配吗？或者，您能告诉我一些方法去调试这个问题吗？补充：「classification」和「mobilenet_ssd」则符合预期(用GPU处理时，处理速度比只用CPU快)

■硬件开发环境・CPU RK3399(2核Cortex-A72+4核Cortex-A53) ・GPU Mali-T864

■软件开发环境・Tengine-Lite 最新版本(commit id:ef85a11c2baf5dc29c380dad36e43fc5d41b6594) ・ACL v20.05(commit id:6a7771e46)

■操作步骤 ①修改代码，添加用GPU处理yolov4-tiny的测试代码「tm_yolov4_tiny_acl_fp32.cpp」 $cd Tengine $cp examples/tm_yolov4_tiny.cpp examples/tm_yolov4_tiny_acl_fp32.cpp $vim examples/tm_yolov4_tiny_acl_fp32.cpp ・修改前 graph_t graph = create_graph(nullptr, "tengine", model_file); ・修改后 context_t acl_context = create_context("acl", 1); add_context_device(acl_context, "ACL"); graph_t graph = create_graph(acl_context, "tengine", model_file); ②把文件"tm_yolov4_tiny_acl_fp32.cpp"添加到"examples/CMakeLists.txt"里 ③编译

■yolov4-tiny试验结果只用CPU的场合，处理速度最快 <详细的试验结果> ①只用CPU的场合，平均处理时间:371.62 ms $./tm_yolov4_tiny -m ~/models/yolov4-tiny.tmfile -i ~/pictures/ssd_dog.jpg -r 10 =========================================================== tengine-lite library version: 1.5-dev Repeat 10 times, thread 1, avg time 371.62 ms, max_time 423.39 ms, min_time 360.14 ms -------------------------------------- detection num: 3 16: 87%, [ 136, 205, 319, 542], dog 7: 81%, [ 463, 79, 703, 170], truck 1: 61%, [ 72, 100, 576, 479], bicycle ===========================================================

②CPU+GPU的场合，平均处理时间:395.78ms $./tm_yolov4_tiny_acl_fp32 -m ~/models/yolov4-tiny.tmfile -i ~/pictures/ssd_dog.jpg -r 10 =========================================================== tengine-lite library version: 1.5-dev ACL initialized ACL initialized ACL initialized ACL initialized ACL initialized Repeat 10 times, thread 1, avg time 395.78 ms, max_time 1089.57 ms, min_time 307.71 ms -------------------------------------- detection num: 3 16: 87%, [ 136, 205, 319, 542], dog 7: 81%, [ 463, 79, 703, 170], truck 1: 61%, [ 72, 100, 576, 479], bicycle Segmentation fault →　★另外一个问题发生：执行函数postrun_graph()时、报错「Segmentation fault」 ===========================================================

■补充原生代码「classification」和「mobilenet_ssd」进行测试时，结果符合预期(用GPU处理时，处理速度比只用CPU快)。 1.「tm_classification」和「tm_classification_acl」 ①只用CPU的场合，平均处理时间:202.44ms ②CPU+GPU的场合，平均处理时间:86.05ms 2.「tm_mobilenet」和「tm_mobilenet_ssd」 ①只用CPU的场合，平均处理时间:255.29ms ②CPU+GPU的场合，平均处理时间:193.82ms

Jan 26 '22 10:01 zhangweiwahaha042

请尝试使用 OpenCL 的后端呢？

Feb 02 '22 15:02 BUG1989

请尝试使用 OpenCL 的后端呢？你好，Tengine团队：非常感谢您的回复。

请问"使用OpenCL的后端"具体是什么样的操作呢？如果可以的话，可否告知一下。不好意思，因为我理解我当前的做法就是将OpenCL作为后端在工作了，所以在这里有些不解。・具体来讲首先，将tengine-lite的硬件后端设置为了"ACL"。然后，在运行tengine-lite时，"ACL"会自当调用OpenCL接口去操作GPU，以实现GPU运行yolov4-tiny的目的。

Feb 04 '22 00:02 zhangweiwahaha042

Tengine Tengine copied to clipboard

在板子「Orange Pi 4(RK3399)」上运行yolov4-tiny，为何用GPU处理的速度比用CPU慢呢？

Tengine
Tengine copied to clipboard