compute-runtime Slow Kernel Launch Times

Using clpeak with runtime 22.43.24595 on the integrated GPU on an i7-12700H CPU under Linux I find the kernel launch latency to be 42.46 us. This is around 10 times slower than can be expected for a recent discrete GPU from AMD/NVIDIA connected over PCIe. In general, one would expect integrated GPUs to have an advantage here since the CPU and GPU both share an LLC.

Jan 09 '23 21:01 FreddieWitherden

Integrated GPU's goes over KMD and GuC for submissions, hence the times you observe are within expected range.

Jan 10 '23 12:01 MichalMrozek

Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running.

For an A770M I've observed launch times on the order of ~15 us on the same system; better, but still over three times higher than what I'm used to seeing.

Jan 10 '23 13:01 FreddieWitherden

On A770M launch time should be around 8us. What is your operating system ?

"Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running."

It is not a hardware limitation, integrated parts are also capable of having direct submission. The code is not ready for enabling though, it requires VM_BIND to be enabled in Linux Kernel, which is not the case for integrated parts.

Jan 11 '23 11:01 MichalMrozek

On A770M launch time should be around 8us. What is your operating system ?

Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).

Jan 11 '23 13:01 FreddieWitherden

Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).

Best Intel dGPU support is currently in backport kernels: https://github.com/intel-gpu/intel-gpu-i915-backports

Binary packages for those are available from Intel repository: https://dgpu-docs.intel.com/installation-guides/index.html

Intel dGPU support enabled in v6.2(-rc1) upstream Linux kernel, is still somewhat lacking behind those.

And older upstream kernel versions are missing even more features, besides needing force-probe option even to recognize Intel dGPUs.

Jan 16 '23 17:01 eero-t

Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.

These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.

Feb 20 '23 20:02 FreddieWitherden

Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.

Is this startup timing for cold start (single run), or average of warm ones (tight loop of runs)?

If you run some other (lightweight) workload for the same GPU in the background (so that GPU frequency is up when your test starts), will dGPU perform better than iGPU?

These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.

Please check with intel_gpu_top (from your distro intel-gpu-tools package) at what frequencies both of these GPUs are running at during your test-cases. Your workloads may be so lightweight for dGPU that it runs at low frequency, but heavy enough to keep iGPU at high frequency.

If this is the case, the issue is rather kernel / FW power management, than compute driver / GPU job submission.

Feb 21 '23 09:02 eero-t

So clpeak does appear to try and fully saturate the GPU when determining launch latency and uses multiple iterations:

https://github.com/krrishnarraj/clpeak/blob/master/src/kernel_latency.cpp#L11

and the results I get are reproducible.

As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).

Feb 21 '23 12:02 FreddieWitherden

Could you give utilization & freq values for both GPUs which latency you are comparing?

You can select which card is shown with the -d option, like this: intel_gpu_top -d drm:/dev/dri/card1.

Feb 27 '23 13:02 eero-t

For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.

Feb 27 '23 13:02 FreddieWitherden

For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.

As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).

Thanks!

So although clpeak is not able to utilized GPU fully (which is a separate problem), at least both are running at full speed, i.e. things are comparable.

PS. On iGPUs, Render/3d is the pipeline with shader cores, which can act in 3D, compute or media mode. Whereas Arc has separate compute engine pipeline. If you build latest IGT version from upstream, it shows [unknown] is the Compute engine: https://gitlab.freedesktop.org/drm/igt-gpu-tools

Feb 27 '23 14:02 eero-t

Hello FreddieWitherden and BA8F0D39

Github issue #600 is still open while answers are already delivered in the middle. Let’s summarize this topic and go for closure. For best and low latency in Arc/Flex dGFX family please consider using Intel OEM kernel which can be obtained following the instruction: https://dgpu-docs.intel.com/installation-guides/index.html: latest version as a package https://dgpu-docs.intel.com/releases/stable_555_20230124.html and source code to build on your own https://github.com/intel-gpu/intel-gpu-i915-backports . OEM kernel driver contains significant change of dispatch model, which allows enabling low latency submission by Compute-Runtime drivers. Direct submission is not available in upstream generic kernel, as mentioned for 6.1/6.2.

With upstreamed 6.1/6.2 generic Linux kernel compute-runtime driver uses legacy submission model, it is same unified path for integrated and discrete Gfx devices. With legacy submission longer kernel latency timings are expected.

Apr 03 '23 12:04 BartusW

Okay sounds good. Are there any plans to upstream the direct submission paths to the mainline Linux kernel?

Apr 03 '23 13:04 FreddieWitherden

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Apr 03 '23 14:04 BartusW

@BartusW Last stable release was in 2023-01-24 Will there be an update?

Apr 03 '23 19:04 BA8F0D39

Are you asking about VM-Bind (the answer is covered above) or just asking in general about kernel launch latency changes?

Apr 04 '23 09:04 BartusW

@BartusW I mean, will the kernel packages at https://repositories.intel.com/graphics/ be updated?

Apr 06 '23 02:04 BA8F0D39

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?

Apr 07 '23 10:04 nyanmisaka

At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.

Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?

Xe kernel driver is based on VM_BIND. For more info, see:

Merge plan (April 2023): https://lore.kernel.org/dri-devel/[email protected]/
Initial submission (Dec 2022): https://patchwork.freedesktop.org/series/112188/

Jun 16 '23 12:06 eero-t

I've read somewhere that there will be no VM_BIND in i915, which will speed up the upstreaming of Xe KMD.

It has become clear that we have a long way towards fully featured implementation of VM_BIND in i915. Examples of the many challenges include integration with display, integration with userspace drivers, a rewrite of all the i915 IGTs to support execbuf3, alignment with DRM GPU VA manager[1] etc.

We are stopping further VM_BIND upstreaming efforts in i915 so we can accelerate the merge plan for the new drm/xe driver[2] which has been designed for VM_BIND from the beginning.

https://lists.freedesktop.org/archives/intel-gfx/2023-April/324237.html

Jun 16 '23 12:06 nyanmisaka

(Disclaimer: I'm not a driver developer, just a spectator, so this is just my clueless observation.)

Doing major architectural change into 915 kernel driver seems practical impossibility to me because it supports 1.5 decades of different Intel GPU HW, I think more than what's listed on this page: https://en.wikipedia.org/wiki/Intel_Graphics_Technology

That span covers multiple generations of user-space compute (OpenCL & SYCL), media, 3D (OpenGL/GLES & Vulkan) drivers (maybe 10 different driver version?), which all would need to be tested and validated, some on very old / hard to get HW, and after large change like that, there likely were still quite a few bugs that users would find only later on. And I do not not see the alternative of kernel driver dropping support for anything older than what the last user-space driver generation supports (i.e. Broadwell up), would be accepted by users either.

Jun 16 '23 13:06 eero-t

compute-runtime compute-runtime copied to clipboard

Slow Kernel Launch Times

compute-runtime
compute-runtime copied to clipboard