clpeak
clpeak copied to clipboard
[src] use CL_PROFILING_COMMAND_END as latency time
CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED is real kernel latency
Is it more accurate to test kernel latency with CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED and run a extreme small kernel? see >20us difference on several ARM MALI GPU device.
Thanks. I agree with the small kernel part. I am seeing more latency for cpu platforms like pocl. How can 'CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED' give better accuracy wrt CL_PROFILING_COMMAND_START?
Thanks. I agree with the small kernel part. I am seeing more latency for cpu platforms like pocl. How can 'CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED' give better accuracy wrt CL_PROFILING_COMMAND_START?
Because kernel launch latency contains pre-launch, post-launch latency and other execution latency. CL_PROFILING_COMMAND_START - CL_PROFILING_COMMAND_QUEUED only calculates pre launch parts but not post launch parts. CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_QUEUED includes both pre and post. The real kernel execution time is almost zero.
From https://stackoverflow.com/questions/39924433/opencl-events-ambiguity it seems to me that CL_PROFILING_COMMAND_SUBMIT - CL_PROFILING_COMMAND_START
is the pre-execution latency. CL_PROFILING_COMMAND_COMPLETE was added in OpenCL 2.0. I'm guessing CL_PROFILING_COMMAND_COMPLETE - CL_PROFILING_COMMAND_END
is the post-execution latency.
There may also a lower bound on CL_PROFILING_COMMAND_END - CL_PROFILING_COMMAND_START
which might be another form of latency.
So CL_PROFILING_COMMAND_COMPLETE - CL_PROFILING_COMMAND_SUBMIT
on very small kernel may be a way to measure the latency.