VC4CL Explanation for performance gap

Explanation for performance gap

Open ThomasDebrunner opened this issue 4 years ago • 4 comments

I am curious about performance measurements / theoretical performance numbers. The often stated theoretical performance of the VideoCore IV is 24GFLOPS.

The author of py-videocore manages to get to 8.32 GFLOPS with hand-optimized code: https://qiita.com/9_ties/items/e0fdd165c1c7df6bb8ee

The fastest claimed measurement with clpeak using VC4CL is also just above 8 GFLOPS. On my raspberry pi, I measure about 6.3 GFLOPS.

So even a synthetic benchmark, and hand-optimized code can only reach about one third of the theoretical performance. For Desktop GPUs, clpeak mostly finds about the same performance as stated by the manufacturer. Where does this large performance gap come from?

Jun 15 '20 14:06 ThomasDebrunner

I gained at most 13.62 GFLOP/s for large count of loop iterations in FlopsCL and with float16. One of the important aspects is to balance kernel length and iterations.

I have many measurements done but will publish them at most in October.

Jun 15 '20 15:06 pfoof

So one big factor are the ALUs. You only get the full 24GFLOPS if you utilize both ALUs for any clock cycle! Since the multiplication ALU does not have that many opcodes, it is definitively not utilized that much.

And of course the other problem will be the memory bandwidth. Compared to the fairly powerful computation power, the memory interfaces are very slow.

And as @pfoof hinted (I think), too big kernel code (or branch skipping too many instructions) might also lead to cache misses loading the instructions. But I don't have any numbers for that.

Jun 15 '20 16:06 doe300

Hey @doe300, I couldn't find any other contact to you and I would like to share my research for master thesis on VC4CL: https://www.researchgate.net/publication/346000679_Performance-energy_energy_benchmarking_of_selected_parallel_programming_platforms_with_OpenCL

Nov 18 '20 13:11 pfoof

@pfoof, very interesting read, thanks for sharing!

I would have hoped the Raspberry Pi fares better with power/computation, but I guess I just have to try to improve the performance :wink:

I definitively have to look at your thesis in more details, especially at the detailed benchmarks, result interpretations and comparisons between Raspberry Pi CPU and GPU performance! One thing I can alread take away: The result of section 4.4. Fibonacci adder suggests that the instruction cache misses (or general the instruction fetching) has a far greater performance impact than I thought. Definitively something I should take a look at.

Nov 18 '20 15:11 doe300

VC4CL VC4CL copied to clipboard

Explanation for performance gap

VC4CL
VC4CL copied to clipboard