compute-runtime
compute-runtime copied to clipboard
OpenCL profile timers and non-blocking calls, wrong values?
I am trying to get a simple example with Zero-Copy working for the Intel Integrated graphics.
See code: https://github.com/jjfumero/opencl-tests/blob/master/saxpyZeroCopyIntelGraphics/saxpy.cpp
When I breakdown the execution to check COPY_IN, KERNEL_TIME and COPY_OUT, the sum of these three timers is higher than the total time.
Here an output example:
Iteration: 10
Write : 65167084 << Supposed to be zero-copy?
X : 31359083
Reading : 882
C++ total: 3.27652e+07
(65167084 + 31359083 + 882) > Total Time = 3.27652e+07.
This only happens when running on the Intel Integrated Graphics. The Intel CPU runtime for OpenCL, as well as AMD or NVIDIA report reasonable timers.
However, If I made the data tranfer blocking, the sum of all OpenCL timers is less than the total time as follows:
int writeBuffer() {
clEnqueueWriteBuffer(commandQueue, ddA, CL_TRUE, 0, elements * sizeof(double), A, 0, NULL, &writeEvent1);
clEnqueueWriteBuffer(commandQueue, ddB, CL_TRUE, 0, elements * sizeof(double), B, 0, NULL, &writeEvent2);
}
Write : 1394
X : 32825749
Reading : 694
C++ total: 3.5109e+07
Is there any issue reporting the timers and using non blocking calls?
Drivers:
I have check the application using the following drivers:
Name: Intel(R) Gen9 HD Graphics NEO
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 2.0
Driver version: 19.34.13959
Profile: FULL_PROFILE
Version: OpenCL 2.1 NEO
And this one
Name: Intel(R) Gen9 HD Graphics NEO
Vendor: Intel(R) Corporation
Device OpenCL C version: OpenCL C 2.0
Driver version: 19.23.13131
Profile: FULL_PROFILE
Version: OpenCL 2.1 NEO
Both versions I obtain the same problem when using non-blocking calls on the integrapted graphics.
I think you may be running into special case of WriteBuffer where driver optimizes the whole call and there is no GPU execution. In such case we have issue as we do not have proper source from GPU about profiling counters and indeed profiling may be wrong.
Can you measure from host side using CPU timers how long this call takes? (follow it with clFlush)
@MichalMrozek thank you for your quick reply.
If I measure with CPU timers:
int writeBuffer() {
auto time1 = chrono::high_resolution_clock::now();
clEnqueueWriteBuffer(commandQueue, ddA, CL_FALSE, 0, elements * sizeof(double), A, 0, NULL, &writeEvent1);
clFlush(commandQueue);
auto time2 = chrono::high_resolution_clock::now();
clEnqueueWriteBuffer(commandQueue, ddB, CL_FALSE, 0, elements * sizeof(double), B, 0, NULL, &writeEvent2);
clFlush(commandQueue);
auto time3 = chrono::high_resolution_clock::now();
double total1 = chrono::duration_cast<chrono::nanoseconds>(time2 - time1).count();
double total2 = chrono::duration_cast<chrono::nanoseconds>(time3 - time2).count();
cout << "COPY IN 1: " << total1 << endl;
cout << "COPY IN 2: " << total2 << endl;
cout << "COPY IN TOTAL: " << (total1 + total2) << endl;
}
I get this:
COPY IN 1: 174462
COPY IN 2: 21889
COPY IN TOTAL: 196351
Result is correct
Iteration: 10
Write : 68344436
X : 33035999
Reading : 840
C++ total: 3.43531e+07
If I use clFinish
instead, I get this:
COPY IN 1: 183641
COPY IN 2: 33142
COPY IN TOTAL: 216783
Result is correct
Iteration: 10
Write : 80125596
X : 38684083 << Execution time seems to be pretty consistent
Reading : 738 << Reading time seems to be pretty consistent
C++ total: 4.02466e+07
OpenCL timers ~68-80ms. CPU Timers ~0.12-0.21ms.
My takeaway here is profile timers for the intel integrated graphics are not working. But, I think the kernel is actually running on the GPU, since I am forcing to run on a GPU type.
We need to investigate this further. I am unable to provide an ETA at this time.
This should've been resolved in recent driver versions. If this or simlar problem persist, please re-open or file a new issue.