Variable definition problem within main_ocl.cpp function
Hi bro. What do you mean by variables "elements per workitem" and “workitem fusion degree” defined in the function?
More or less they both express the amount of workload assigned to each workitem. However, they differ on the ordering of replicated operations and the applied memory access patterns. So, you might want to experiment by changing both values.
What is the overall design idea,Will there be any differences between GPUs of different architectures?
Certainly, these parameters can have different impact on different GPU architectures. Actually these parameters had been very first introduced to address different optimizations between NVidia and AMD GPUs. Even the compiler plays a significant role on this as it might lead to different patterns. So, you can do your experiments to optimize these values or leave the default ones if you don't want to focus on a specific architecture.
Why did the results I measured and did not reach the theoretical value, and what could be the reason for this?