llvm icon indicating copy to clipboard operation
llvm copied to clipboard

can we use time difference between two events?

Open guoyejun opened this issue 3 years ago • 14 comments

copy from https://docs.oneapi.io/versions/latest/iface/event.html

Descriptor Return type Description
command_submit cl_ulong Time in nanoseconds when command_group was submitted
command_start cl_ulong Time in nanoseconds when command_group started execution
command_end cl_ulong Time in nanoseconds when command_group finished execution

"Time in nanoseconds" gives me the sense that it is an absolute time (it doesn't matter what the starting point is), so, I think it makes sense to use the time difference between two events as below pseudocode.

ycl::property_list properties{sycl::property::queue::enable_profiling(), sycl::property::queue::in_order()};
auto q = sycl::queue(sycl::gpu_selector(), properties);
...
q.wait();

auto e1 = q.ext_oneapi_submit_barrier();

// ... do what we need to queue workloads into sycl queue
foo();

auto e2 = q.ext_oneapi_submit_barrier();
e2.wait();

elapsed_time = e2.get_profiling_info<sycl::info::event_profiling::command_submit>()  -  
            e1.get_profiling_info<sycl::info::event_profiling::command_submit>()

is elapsed_time (in pseudocode) the GPU time to execute foo()? thanks.

guoyejun avatar Aug 02 '22 11:08 guoyejun

I wrote some code to verify it, but looks that some values are not as expected.

#include <CL/sycl.hpp>
#include <unistd.h>

void print_event_profiling(const sycl::event& event, const char* prefix = "NULL")
{
  auto end =
      event.get_profiling_info<sycl::info::event_profiling::command_end>();
  auto start =
      event.get_profiling_info<sycl::info::event_profiling::command_start>();
  auto submit =
      event.get_profiling_info<sycl::info::event_profiling::command_submit>();

  std::cout << prefix << " submit: " << submit / 1.0e6 << " ms" << "\t\t";
  std::cout << "start: " << start / 1.0e6 << " ms" << "\t";
  std::cout << "end: " << end / 1.0e6 << " ms" << std::endl;

}

sycl::event add_empty_kernel(sycl::queue q)
{
  // return q.ext_oneapi_submit_barrier();  // the same result to use this line
  auto cgf = [&](sycl::handler& cgh) { cgh.single_task([=]() {}); };
  return q.submit(cgf);
}

int main() {
  sycl::property_list properties{sycl::property::queue::enable_profiling(), sycl::property::queue::in_order()};
  auto q = sycl::queue(sycl::gpu_selector(), properties);

  std::cout
      << "  Platform: "
      << q.get_device().get_platform().get_info<sycl::info::platform::name>()
      << std::endl;

  const int num_ints = 1024 * 1024;
  const size_t num_bytes = num_ints * sizeof(int);

  // Alloc memory on device
  auto data = sycl::malloc_device<int>(num_ints, q);
  q.memset(data, 0x12, num_bytes).wait();

  auto e1 = add_empty_kernel(q);
  print_event_profiling(e1, "e1");

  auto eventa = q.submit([&](sycl::handler &h) {
      h.parallel_for(num_ints, [=](cl::sycl::item<1> item) {
          int idx = item.get_id(0);
          for (int i = 0; i < 1000; ++i)
            data[idx] = int(sycl::sin(float(data[idx]))) + idx;
      });
  });
  eventa.wait();
  //sleep(1);
  print_event_profiling(eventa, "ea");
  //sleep(2);

  auto e2 = add_empty_kernel(q);
  //sleep(3);
  print_event_profiling(e2, "e2");

  auto eventb = q.submit([&](sycl::handler &h) {
      h.parallel_for(num_ints, [=](cl::sycl::item<1> item) {
          int idx = item.get_id(0);
          for (int i = 0; i < 10000; ++i)
            data[idx] = int(sycl::sin(float(data[idx]))) + idx;
      });
  });
  eventb.wait();
  print_event_profiling(eventb, "eb");

  auto e3 = add_empty_kernel(q);
  print_event_profiling(e3, "e3");

  auto e4 = add_empty_kernel(q);
  print_event_profiling(e4, "e4");

  sycl::free(data, q);
}
$ dpcpp event_profiling.cpp
$ ./a.out
  Platform: Intel(R) Level-Zero
e1 submit: 0 ms         start: 0.02432 ms       end: 0.03408 ms
ea submit: 0 ms         start: 0.29888 ms       end: 27.0397 ms
e2 submit: 0 ms         start: 27.0723 ms       end: 27.077 ms
eb submit: 0 ms         start: 27.4264 ms       end: 295.89 ms
e3 submit: 0 ms         start: 295.963 ms       end: 295.967 ms
e4 submit: 0 ms         start: 295.999 ms       end: 296.003 ms

Looks that the answer is yes for my question "is elapsed_time (in pseudocode) the GPU time to execute foo()? " according to the start and end values above, is it correct?

And I have two other questions.

  1. why every event's submit is 0 ms? "submit 0 ms " can conclude that every event is starting from a new reset clock since it is zero. But, the start and end values show that the events are using a same clock (no reset). How to understand it?

  2. what's the exact meaning of submit/start/end time? I added several 'sleep(seconds)' functions in the source code, it is a host blocking function, it will cause the GPU pipeline idle for some time (the sleep time is much much bigger than the kernel execution time). No matter the GPU is idle or not, the wall time is elapsed, and my expectation of the event start/end time will be impacted by the sleep time. But, the result shows that the submit/start/end time does not change no matter how I add 'sleep' in the source code. How to understand it?

thanks

guoyejun avatar Aug 02 '22 11:08 guoyejun

anyone can help? thanks

guoyejun avatar Aug 09 '22 02:08 guoyejun

This looks like a limitation of level zero API/plug-in. See https://github.com/intel/llvm/blob/sycl/sycl/plugins/level_zero/pi_level_zero.cpp#L5686-L5688.

I suppose if you switch to OpenCL back-end you should see expected results.

bader avatar Aug 09 '22 16:08 bader

thanks @bader.

I tried the same code twice on another machine for level zero backend and opencl backend, see below.

Looks that the opencl backend has another explanation of the time, what is the time 0 for opencl backend? thanks.

$ time ./a.out
  Platform: Intel(R) Level-Zero
e1 submit: 0 ms         start: 0.08496 ms       end: 0.11904 ms
ea submit: 0 ms         start: 0.67808 ms       end: 134.3 ms
e2 submit: 0 ms         start: 134.361 ms       end: 134.382 ms
eb submit: 0 ms         start: 135.161 ms       end: 1473.67 ms
e3 submit: 0 ms         start: 1473.77 ms       end: 1473.79 ms
e4 submit: 0 ms         start: 1473.84 ms       end: 1473.86 ms

real    0m2.467s
user    0m1.161s
sys     0m0.734s

$ time ./a.out
  Platform: Intel(R) Level-Zero
e1 submit: 0 ms         start: 0.08336 ms       end: 0.1176 ms
ea submit: 0 ms         start: 0.6288 ms        end: 134.242 ms
e2 submit: 0 ms         start: 134.304 ms       end: 134.324 ms
eb submit: 0 ms         start: 134.893 ms       end: 1469.54 ms
e3 submit: 0 ms         start: 1469.64 ms       end: 1469.66 ms
e4 submit: 0 ms         start: 1469.72 ms       end: 1469.74 ms

real    0m2.333s
user    0m1.188s
sys     0m0.704s

$ export SYCL_DEVICE_FILTER=opencl
$ time ./a.out
  Platform: Intel(R) OpenCL HD Graphics
e1 submit: 4.19932e+07 ms               start: 4.19932e+07 ms   end: 4.19932e+07 ms
ea submit: 4.19932e+07 ms               start: 4.19932e+07 ms   end: 4.19933e+07 ms
e2 submit: 4.19933e+07 ms               start: 4.19933e+07 ms   end: 4.19933e+07 ms
eb submit: 4.19933e+07 ms               start: 4.19933e+07 ms   end: 4.1994e+07 ms
e3 submit: 4.1994e+07 ms                start: 4.1994e+07 ms    end: 4.1994e+07 ms
e4 submit: 4.1994e+07 ms                start: 4.1994e+07 ms    end: 4.1994e+07 ms

real    0m1.699s
user    0m0.783s
sys     0m0.442s


$ time ./a.out
  Platform: Intel(R) OpenCL HD Graphics
e1 submit: 4.19967e+07 ms               start: 4.19967e+07 ms   end: 4.19967e+07 ms
ea submit: 4.19967e+07 ms               start: 4.19967e+07 ms   end: 4.19968e+07 ms
e2 submit: 4.19968e+07 ms               start: 4.19968e+07 ms   end: 4.19968e+07 ms
eb submit: 4.19968e+07 ms               start: 4.19968e+07 ms   end: 4.19975e+07 ms
e3 submit: 4.19975e+07 ms               start: 4.19975e+07 ms   end: 4.19975e+07 ms
e4 submit: 4.19975e+07 ms               start: 4.19975e+07 ms   end: 4.19975e+07 ms

real    0m1.678s
user    0m0.791s
sys     0m0.436s

guoyejun avatar Aug 10 '22 01:08 guoyejun

what is the time 0 for opencl backend?

Sorry, I don't get what do you mean here. Could you add more context, please?

bader avatar Aug 10 '22 07:08 bader

what is the starting time (the origin of the time line) for opencl backend, is it the time when the machine is power on? thanks.

guoyejun avatar Aug 10 '22 07:08 guoyejun

You can find the description of returned values in OpenCL spec: https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_API.html#profiling-operations. DPC++ runtime returns to user values reported by OpenCL API as is.

bader avatar Aug 10 '22 07:08 bader

thanks @bader ,

OpenCL spec says: A 64-bit value that describes the current device time counter in nanoseconds when the command identified by event ..., and so time 0 means device time counter is zero, that should be the time when the machine is power on.

btw, for the level zero backend, does the DPC++ runtime returns the same meaning as OpenCL backend? If yes, there should be a bug, see my log above. If not, DPC++ spec needs to add description for the difference.

guoyejun avatar Aug 10 '22 08:08 guoyejun

As I mentioned in my first comment - it's known issue of level zero backend.

0 means device time counter is zero, that should be the time when the machine is power on.

I'm not sure if this is the right interpretation of absolution value. Moreover, I think the primary use case for this numbers is computing the difference between them, i.e. you can rely on relative values.

bader avatar Aug 10 '22 08:08 bader

As I mentioned in my first comment - it's known issue of level zero backend.

it is a known issue for submit time of level zero backend. My concern is that the start/end time does not align with the opencl backend. It is a new issue.

i.e. you can rely on relative values.

yes, thanks.

and, I think it is not fully correct for level zero backend.

In my first comment, I mentioned that "sleep(seconds)" does not change any event value for the level zero backend, also see below, the 'diff' does not change if we add 'sleep(10)' or not. So, the relative value of different events is not right.

e1 = ...
e1.wait();
query e1

sleep(10)

e2 = ...
e2.wait
query e2

diff = e2.end_time - e1.end_time

To verify it, I tried the opencl backend, the 'diff' includes the time of 'sleep(10)', the relative value of different events is correct, it is expected.

guoyejun avatar Aug 10 '22 08:08 guoyejun

sleep modifies the execution time of host application. SYCL events return time for things happening on the device, which are executed asynchronously to the host, so adding sleep like this has no effect of values return by profiling operations.

bader avatar Aug 10 '22 08:08 bader

yes, and so I added event.wait() for explicit sync in the code, and so sleep consumes the time at both host side and device side, and so the device time counter should be changed with sleep, and the returned values should be changed.

guoyejun avatar Aug 10 '22 08:08 guoyejun

Quite possible. I see that Level Zero might return different values for different device types, but I'm not sure if the plug-in takes this into account.

[out] Returns the resolution of device timer used for profiling, timestamps, etc. When stype==ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES the units are in nanoseconds. When stype==ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES_1_2 units are in cycles/sec

@smaslov-intel, does L0 plug-in return time in nanoseconds for both stype values ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES and ZE_STRUCTURE_TYPE_DEVICE_PROPERTIES_1_2?

bader avatar Aug 10 '22 09:08 bader

i verified that event.get_profiling_info returns value in nanoseconds on my system for level zero backend.

I would say that there's a bug of level zero backend when the case is CPU bound (I mimic it with sleep in my code), while the behavior of opencl backend is expected. Then, all the profilers based on event.get_profiling_info might not be correct.

#include <CL/sycl.hpp>
#include <unistd.h>

sycl::event add_empty_kernel(sycl::queue q)
{
  // return q.ext_oneapi_submit_barrier();  // the same
  auto cgf = [&](sycl::handler& cgh) { cgh.single_task([=]() {}); };
  return q.submit(cgf);
}

int main() {
  sycl::property_list properties{sycl::property::queue::enable_profiling(), sycl::property::queue::in_order()};
  auto q = sycl::queue(sycl::gpu_selector(), properties);

  std::cout
      << "  Platform: "
      << q.get_device().get_platform().get_info<sycl::info::platform::name>()
      << std::endl;

  const int num_ints = 1024 * 1024;
  const size_t num_bytes = num_ints * sizeof(int);

  // Alloc memory on device
  auto data = sycl::malloc_device<int>(num_ints, q);
  q.memset(data, 0x12, num_bytes).wait();

  auto e1 = add_empty_kernel(q);
  e1.wait();
  auto e1_end =
      e1.get_profiling_info<sycl::info::event_profiling::command_end>();

  auto wall_begin = std::chrono::steady_clock::now();
  sleep(10);
  auto e2 = q.submit([&](sycl::handler &h) {
      h.parallel_for(num_ints, [=](cl::sycl::item<1> item) {
          int idx = item.get_id(0);
          for (int i = 0; i < 100000; ++i)
            data[idx] = int(sycl::sin(float(data[idx]))) + idx;
      });
  });
  e2.wait();
  auto wall_end = std::chrono::steady_clock::now();
  auto e2_end =
      e2.get_profiling_info<sycl::info::event_profiling::command_end>();

  auto event_diff = (e2_end - e1_end) / 1e6;
  auto wall_diff = std::chrono::duration_cast<std::chrono::milliseconds>(wall_end - wall_begin).count();
  std::cout << "event time elapsed " << event_diff << " ms" << std::endl;
  std::cout << "on wall time elapsed " << wall_diff << " ms" << std::endl;

  sycl::free(data, q);
  return 0;
}
$ ./a.out
  Platform: Intel(R) Level-Zero
event time elapsed    2969.74 ms
on wall time elapsed 12987 ms

$ SYCL_DEVICE_FILTER=opencl ./a.out
  Platform: Intel(R) OpenCL HD Graphics
event time elapsed   11343 ms
on wall time elapsed 11344 ms

guoyejun avatar Aug 11 '22 02:08 guoyejun