llvm
llvm copied to clipboard
can we use time difference between two events?
copy from https://docs.oneapi.io/versions/latest/iface/event.html
| Descriptor | Return type | Description |
|---|---|---|
| command_submit | cl_ulong | Time in nanoseconds when command_group was submitted |
| command_start | cl_ulong | Time in nanoseconds when command_group started execution |
| command_end | cl_ulong | Time in nanoseconds when command_group finished execution |
"Time in nanoseconds" gives me the sense that it is an absolute time (it doesn't matter what the starting point is), so, I think it makes sense to use the time difference between two events as below pseudocode.
ycl::property_list properties{sycl::property::queue::enable_profiling(), sycl::property::queue::in_order()};
auto q = sycl::queue(sycl::gpu_selector(), properties);
...
q.wait();
auto e1 = q.ext_oneapi_submit_barrier();
// ... do what we need to queue workloads into sycl queue
foo();
auto e2 = q.ext_oneapi_submit_barrier();
e2.wait();
elapsed_time = e2.get_profiling_info<sycl::info::event_profiling::command_submit>() -
e1.get_profiling_info<sycl::info::event_profiling::command_submit>()
is elapsed_time (in pseudocode) the GPU time to execute foo()? thanks.
I wrote some code to verify it, but looks that some values are not as expected.
#include <CL/sycl.hpp>
#include <unistd.h>
void print_event_profiling(const sycl::event& event, const char* prefix = "NULL")
{
auto end =
event.get_profiling_info<sycl::info::event_profiling::command_end>();
auto start =
event.get_profiling_info<sycl::info::event_profiling::command_start>();
auto submit =
event.get_profiling_info<sycl::info::event_profiling::command_submit>();
std::cout << prefix << " submit: " << submit / 1.0e6 << " ms" << "\t\t";
std::cout << "start: " << start / 1.0e6 << " ms" << "\t";
std::cout << "end: " << end / 1.0e6 << " ms" << std::endl;
}
sycl::event add_empty_kernel(sycl::queue q)
{
// return q.ext_oneapi_submit_barrier(); // the same result to use this line
auto cgf = [&](sycl::handler& cgh) { cgh.single_task([=]() {}); };
return q.submit(cgf);
}
int main() {
sycl::property_list properties{sycl::property::queue::enable_profiling(), sycl::property::queue::in_order()};
auto q = sycl::queue(sycl::gpu_selector(), properties);
std::cout
<< " Platform: "
<< q.get_device().get_platform().get_info<sycl::info::platform::name>()
<< std::endl;
const int num_ints = 1024 * 1024;
const size_t num_bytes = num_ints * sizeof(int);
// Alloc memory on device
auto data = sycl::malloc_device<int>(num_ints, q);
q.memset(data, 0x12, num_bytes).wait();
auto e1 = add_empty_kernel(q);
print_event_profiling(e1, "e1");
auto eventa = q.submit([&](sycl::handler &h) {
h.parallel_for(num_ints, [=](cl::sycl::item<1> item) {
int idx = item.get_id(0);
for (int i = 0; i < 1000; ++i)
data[idx] = int(sycl::sin(float(data[idx]))) + idx;
});
});
eventa.wait();
//sleep(1);
print_event_profiling(eventa, "ea");
//sleep(2);
auto e2 = add_empty_kernel(q);
//sleep(3);
print_event_profiling(e2, "e2");
auto eventb = q.submit([&](sycl::handler &h) {
h.parallel_for(num_ints, [=](cl::sycl::item<1> item) {
int idx = item.get_id(0);
for (int i = 0; i < 10000; ++i)
data[idx] = int(sycl::sin(float(data[idx]))) + idx;
});
});
eventb.wait();
print_event_profiling(eventb, "eb");
auto e3 = add_empty_kernel(q);
print_event_profiling(e3, "e3");
auto e4 = add_empty_kernel(q);
print_event_profiling(e4, "e4");
sycl::free(data, q);
}
$ dpcpp event_profiling.cpp
$ ./a.out
Platform: Intel(R) Level-Zero
e1 submit: 0 ms start: 0.02432 ms end: 0.03408 ms
ea submit: 0 ms start: 0.29888 ms end: 27.0397 ms
e2 submit: 0 ms start: 27.0723 ms end: 27.077 ms
eb submit: 0 ms start: 27.4264 ms end: 295.89 ms
e3 submit: 0 ms start: 295.963 ms end: 295.967 ms
e4 submit: 0 ms start: 295.999 ms end: 296.003 ms
Looks that the answer is yes for my question "is elapsed_time (in pseudocode) the GPU time to execute foo()? " according to the start and end values above, is it correct?
And I have two other questions.
-
why every event's submit is 0 ms? "submit 0 ms " can conclude that every event is starting from a new reset clock since it is zero. But, the start and end values show that the events are using a same clock (no reset). How to understand it?
-
what's the exact meaning of submit/start/end time? I added several 'sleep(seconds)' functions in the source code, it is a host blocking function, it will cause the GPU pipeline idle for some time (the sleep time is much much bigger than the kernel execution time). No matter the GPU is idle or not, the wall time is elapsed, and my expectation of the event start/end time will be impacted by the sleep time. But, the result shows that the submit/start/end time does not change no matter how I add 'sleep' in the source code. How to understand it?
thanks
anyone can help? thanks
This looks like a limitation of level zero API/plug-in. See https://github.com/intel/llvm/blob/sycl/sycl/plugins/level_zero/pi_level_zero.cpp#L5686-L5688.
I suppose if you switch to OpenCL back-end you should see expected results.
thanks @bader.
I tried the same code twice on another machine for level zero backend and opencl backend, see below.
Looks that the opencl backend has another explanation of the time, what is the time 0 for opencl backend? thanks.
$ time ./a.out
Platform: Intel(R) Level-Zero
e1 submit: 0 ms start: 0.08496 ms end: 0.11904 ms
ea submit: 0 ms start: 0.67808 ms end: 134.3 ms
e2 submit: 0 ms start: 134.361 ms end: 134.382 ms
eb submit: 0 ms start: 135.161 ms end: 1473.67 ms
e3 submit: 0 ms start: 1473.77 ms end: 1473.79 ms
e4 submit: 0 ms start: 1473.84 ms end: 1473.86 ms
real 0m2.467s
user 0m1.161s
sys 0m0.734s
$ time ./a.out
Platform: Intel(R) Level-Zero
e1 submit: 0 ms start: 0.08336 ms end: 0.1176 ms
ea submit: 0 ms start: 0.6288 ms end: 134.242 ms
e2 submit: 0 ms start: 134.304 ms end: 134.324 ms
eb submit: 0 ms start: 134.893 ms end: 1469.54 ms
e3 submit: 0 ms start: 1469.64 ms end: 1469.66 ms
e4 submit: 0 ms start: 1469.72 ms end: 1469.74 ms
real 0m2.333s
user 0m1.188s
sys 0m0.704s
$ export SYCL_DEVICE_FILTER=opencl
$ time ./a.out
Platform: Intel(R) OpenCL HD Graphics
e1 submit: 4.19932e+07 ms start: 4.19932e+07 ms end: 4.19932e+07 ms
ea submit: 4.19932e+07 ms start: 4.19932e+07 ms end: 4.19933e+07 ms
e2 submit: 4.19933e+07 ms start: 4.19933e+07 ms end: 4.19933e+07 ms
eb submit: 4.19933e+07 ms start: 4.19933e+07 ms end: 4.1994e+07 ms
e3 submit: 4.1994e+07 ms start: 4.1994e+07 ms end: 4.1994e+07 ms
e4 submit: 4.1994e+07 ms start: 4.1994e+07 ms end: 4.1994e+07 ms
real 0m1.699s
user 0m0.783s
sys 0m0.442s
$ time ./a.out
Platform: Intel(R) OpenCL HD Graphics
e1 submit: 4.19967e+07 ms start: 4.19967e+07 ms end: 4.19967e+07 ms
ea submit: 4.19967e+07 ms start: 4.19967e+07 ms end: 4.19968e+07 ms
e2 submit: 4.19968e+07 ms start: 4.19968e+07 ms end: 4.19968e+07 ms
eb submit: 4.19968e+07 ms start: 4.19968e+07 ms end: 4.19975e+07 ms
e3 submit: 4.19975e+07 ms start: 4.19975e+07 ms end: 4.19975e+07 ms
e4 submit: 4.19975e+07 ms start: 4.19975e+07 ms end: 4.19975e+07 ms
real 0m1.678s
user 0m0.791s
sys 0m0.436s
what is the time 0 for opencl backend?
Sorry, I don't get what do you mean here. Could you add more context, please?
what is the starting time (the origin of the time line) for opencl backend, is it the time when the machine is power on? thanks.
You can find the description of returned values in OpenCL spec: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#profiling-operations. DPC++ runtime returns to user values reported by OpenCL API as is.
thanks @bader ,
OpenCL spec says: A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event ..., and so time 0 means device time counter is zero, that should be the time when the machine is power on.
btw, for the level zero backend, does the DPC++ runtime returns the same meaning as OpenCL backend? If yes, there should be a bug, see my log above. If not, DPC++ spec needs to add description for the difference.
As I mentioned in my first comment - it's known issue of level zero backend.
0 means device time counter is zero, that should be the time when the machine is power on.
I'm not sure if this is the right interpretation of absolution value. Moreover, I think the primary use case for this numbers is computing the difference between them, i.e. you can rely on relative values.
As I mentioned in my first comment - it's known issue of level zero backend.
it is a known issue for submit time of level zero backend. My concern is that the start/end time does not align with the opencl backend. It is a new issue.
i.e. you can rely on relative values.
yes, thanks.
and, I think it is not fully correct for level zero backend.
In my first comment, I mentioned that "sleep(seconds)" does not change any event value for the level zero backend, also see below, the 'diff' does not change if we add 'sleep(10)' or not. So, the relative value of different events is not right.
e1 = ...
e1.wait();
query e1
sleep(10)
e2 = ...
e2.wait
query e2
diff = e2.end_time - e1.end_time
To verify it, I tried the opencl backend, the 'diff' includes the time of 'sleep(10)', the relative value of different events is correct, it is expected.
sleep modifies the execution time of host application. SYCL events return time for things happening on the device, which are executed asynchronously to the host, so adding sleep like this has no effect of values return by profiling operations.
yes, and so I added event.wait() for explicit sync in the code, and so sleep consumes the time at both host side and device side, and so the device time counter should be changed with sleep, and the returned values should be changed.
Quite possible. I see that Level Zero might return different values for different device types, but I'm not sure if the plug-in takes this into account.
[out] Returns the resolution of device timer used for profiling, timestamps, etc. When stype==ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES the units are in nanoseconds. When stype==ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES_1_2 units are in cycles/sec
@smaslov-intel, does L0 plug-in return time in nanoseconds for both stype values ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES and ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES_1_2?
i verified that event.get_profiling_info returns value in nanoseconds on my system for level zero backend.
I would say that there's a bug of level zero backend when the case is CPU bound (I mimic it with sleep in my code), while the behavior of opencl backend is expected. Then, all the profilers based on event.get_profiling_info might not be correct.
#include <CL/sycl.hpp>
#include <unistd.h>
sycl::event add_empty_kernel(sycl::queue q)
{
// return q.ext_oneapi_submit_barrier(); // the same
auto cgf = [&](sycl::handler& cgh) { cgh.single_task([=]() {}); };
return q.submit(cgf);
}
int main() {
sycl::property_list properties{sycl::property::queue::enable_profiling(), sycl::property::queue::in_order()};
auto q = sycl::queue(sycl::gpu_selector(), properties);
std::cout
<< " Platform: "
<< q.get_device().get_platform().get_info<sycl::info::platform::name>()
<< std::endl;
const int num_ints = 1024 * 1024;
const size_t num_bytes = num_ints * sizeof(int);
// Alloc memory on device
auto data = sycl::malloc_device<int>(num_ints, q);
q.memset(data, 0x12, num_bytes).wait();
auto e1 = add_empty_kernel(q);
e1.wait();
auto e1_end =
e1.get_profiling_info<sycl::info::event_profiling::command_end>();
auto wall_begin = std::chrono::steady_clock::now();
sleep(10);
auto e2 = q.submit([&](sycl::handler &h) {
h.parallel_for(num_ints, [=](cl::sycl::item<1> item) {
int idx = item.get_id(0);
for (int i = 0; i < 100000; ++i)
data[idx] = int(sycl::sin(float(data[idx]))) + idx;
});
});
e2.wait();
auto wall_end = std::chrono::steady_clock::now();
auto e2_end =
e2.get_profiling_info<sycl::info::event_profiling::command_end>();
auto event_diff = (e2_end - e1_end) / 1e6;
auto wall_diff = std::chrono::duration_cast<std::chrono::milliseconds>(wall_end - wall_begin).count();
std::cout << "event time elapsed " << event_diff << " ms" << std::endl;
std::cout << "on wall time elapsed " << wall_diff << " ms" << std::endl;
sycl::free(data, q);
return 0;
}
$ ./a.out
Platform: Intel(R) Level-Zero
event time elapsed 2969.74 ms
on wall time elapsed 12987 ms
$ SYCL_DEVICE_FILTER=opencl ./a.out
Platform: Intel(R) OpenCL HD Graphics
event time elapsed 11343 ms
on wall time elapsed 11344 ms