compute-runtime OpenCL profile timers and non-blocking calls, wrong values?

I am trying to get a simple example with Zero-Copy working for the Intel Integrated graphics.

See code: https://github.com/jjfumero/opencl-tests/blob/master/saxpyZeroCopyIntelGraphics/saxpy.cpp

When I breakdown the execution to check COPY_IN, KERNEL_TIME and COPY_OUT, the sum of these three timers is higher than the total time.

Here an output example:

Iteration: 10
Write    : 65167084   << Supposed to be zero-copy?
X        : 31359083
Reading  : 882
C++ total: 3.27652e+07

(65167084 + 31359083 + 882) > Total Time = 3.27652e+07.

This only happens when running on the Intel Integrated Graphics. The Intel CPU runtime for OpenCL, as well as AMD or NVIDIA report reasonable timers.

However, If I made the data tranfer blocking, the sum of all OpenCL timers is less than the total time as follows:

int writeBuffer() {
	clEnqueueWriteBuffer(commandQueue, ddA, CL_TRUE, 0, elements * sizeof(double), A, 0, NULL, &writeEvent1);
	clEnqueueWriteBuffer(commandQueue, ddB, CL_TRUE, 0, elements * sizeof(double), B, 0, NULL, &writeEvent2);
}

Write    : 1394
X        : 32825749
Reading  : 694
C++ total: 3.5109e+07

Is there any issue reporting the timers and using non blocking calls?

Drivers:

I have check the application using the following drivers:

  Name:					 Intel(R) Gen9 HD Graphics NEO
  Vendor:					 Intel(R) Corporation
  Device OpenCL C version:			 OpenCL C 2.0 
  Driver version:				 19.34.13959
  Profile:					 FULL_PROFILE
 Version:					 OpenCL 2.1 NEO

And this one

Name:						 Intel(R) Gen9 HD Graphics NEO
  Vendor:					 Intel(R) Corporation
  Device OpenCL C version:			 OpenCL C 2.0 
  Driver version:				 19.23.13131
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 2.1 NEO

Both versions I obtain the same problem when using non-blocking calls on the integrapted graphics.

Oct 08 '19 08:10 jjfumero

I think you may be running into special case of WriteBuffer where driver optimizes the whole call and there is no GPU execution. In such case we have issue as we do not have proper source from GPU about profiling counters and indeed profiling may be wrong.

Can you measure from host side using CPU timers how long this call takes? (follow it with clFlush)

Oct 08 '19 10:10 MichalMrozek

@MichalMrozek thank you for your quick reply.

If I measure with CPU timers:

int writeBuffer() {
	auto time1 = chrono::high_resolution_clock::now();
	clEnqueueWriteBuffer(commandQueue, ddA, CL_FALSE, 0, elements * sizeof(double), A, 0, NULL, &writeEvent1);
	clFlush(commandQueue);
	auto time2 = chrono::high_resolution_clock::now();
	clEnqueueWriteBuffer(commandQueue, ddB, CL_FALSE, 0, elements * sizeof(double), B, 0, NULL, &writeEvent2);
	clFlush(commandQueue);
	auto time3 = chrono::high_resolution_clock::now();
	double total1 = chrono::duration_cast<chrono::nanoseconds>(time2 - time1).count();
	double total2 = chrono::duration_cast<chrono::nanoseconds>(time3 - time2).count();
	cout << "COPY IN 1: " << total1 << endl;
	cout << "COPY IN 2: " << total2 << endl;
	cout << "COPY IN TOTAL: " << (total1 + total2) << endl;
}

I get this:

COPY IN 1: 174462
COPY IN 2: 21889
COPY IN TOTAL: 196351

Result is correct
Iteration: 10
Write    : 68344436  
X        : 33035999
Reading  : 840
C++ total: 3.43531e+07

If I use clFinish instead, I get this:

COPY IN 1: 183641
COPY IN 2: 33142
COPY IN TOTAL: 216783

Result is correct
Iteration: 10
Write    : 80125596
X        : 38684083    << Execution time seems to be pretty consistent
Reading  : 738         << Reading time seems to be pretty consistent
C++ total: 4.02466e+07

OpenCL timers ~68-80ms. CPU Timers ~0.12-0.21ms.

My takeaway here is profile timers for the intel integrated graphics are not working. But, I think the kernel is actually running on the GPU, since I am forcing to run on a GPU type.

Oct 08 '19 12:10 jjfumero

We need to investigate this further. I am unable to provide an ETA at this time.

Oct 18 '19 11:10 AdamCetnerowski

This should've been resolved in recent driver versions. If this or simlar problem persist, please re-open or file a new issue.

Nov 24 '22 08:11 AdamCetnerowski

compute-runtime compute-runtime copied to clipboard

OpenCL profile timers and non-blocking calls, wrong values?

Drivers:

compute-runtime
compute-runtime copied to clipboard