compute-runtime
compute-runtime copied to clipboard
Slow Kernel Launch Times
Using clpeak with runtime 22.43.24595 on the integrated GPU on an i7-12700H CPU under Linux I find the kernel launch latency to be 42.46 us. This is around 10 times slower than can be expected for a recent discrete GPU from AMD/NVIDIA connected over PCIe. In general, one would expect integrated GPUs to have an advantage here since the CPU and GPU both share an LLC.
Integrated GPU's goes over KMD and GuC for submissions, hence the times you observe are within expected range.
Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush
calls to get submitted work items running.
For an A770M I've observed launch times on the order of ~15 us on the same system; better, but still over three times higher than what I'm used to seeing.
On A770M launch time should be around 8us. What is your operating system ?
"Is this a hardware limitation? On Linux, both AMD and NVIDIA have moved to userspace submission so launching a kernel is reduced to a memcpy + atomic. In addition to massively reducing latency this also avoids the need for clFlush calls to get submitted work items running."
It is not a hardware limitation, integrated parts are also capable of having direct submission. The code is not ready for enabling though, it requires VM_BIND to be enabled in Linux Kernel, which is not the case for integrated parts.
On A770M launch time should be around 8us. What is your operating system ?
Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).
Those measurements were take on Linux with a 6.1 kernel. I'd repeat them although I currently can not get OpenCL to work on my A770 (although it works fine as a display adapter).
Best Intel dGPU support is currently in backport kernels: https://github.com/intel-gpu/intel-gpu-i915-backports
Binary packages for those are available from Intel repository: https://dgpu-docs.intel.com/installation-guides/index.html
Intel dGPU support enabled in v6.2(-rc1) upstream Linux kernel, is still somewhat lacking behind those.
And older upstream kernel versions are missing even more features, besides needing force-probe option even to recognize Intel dGPUs.
Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.
These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.
Having upgraded to 6.2 I now have numbers for my A770M and the integrated GPU. Specifically, the A770M clocks in at 99.43us and the i7-12700H is still there at ~45us.
Is this startup timing for cold start (single run), or average of warm ones (tight loop of runs)?
If you run some other (lightweight) workload for the same GPU in the background (so that GPU frequency is up when your test starts), will dGPU perform better than iGPU?
These numbers from clpeak are consistent with my own real-world (but still launch-latency-sensitive) test cases where the integrated GPU is outperforming my A770M despite it having a huge advantage in execution resources and bandwidth.
Please check with intel_gpu_top
(from your distro intel-gpu-tools
package) at what frequencies both of these GPUs are running at during your test-cases. Your workloads may be so lightweight for dGPU that it runs at low frequency, but heavy enough to keep iGPU at high frequency.
If this is the case, the issue is rather kernel / FW power management, than compute driver / GPU job submission.
So clpeak does appear to try and fully saturate the GPU when determining launch latency and uses multiple iterations:
https://github.com/krrishnarraj/clpeak/blob/master/src/kernel_latency.cpp#L11
and the results I get are reproducible.
As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top
with an [unknown]
engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).
Could you give utilization & freq values for both GPUs which latency you are comparing?
You can select which card is shown with the -d
option, like this: intel_gpu_top -d drm:/dev/dri/card1
.
For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D
engine which is fixed at about ~40% give or take a percent.
For my integrate GPU the clock speed is fixed throughout the entire case (which takes ~6 minutes or so) at ~1400 MHz give or take 5 MHz. The utilisation here is from Render/3D engine which is fixed at about ~40% give or take a percent.
As for my real-world test case the clocks are pinned at ~2050 Mhz according to intel_gpu_top with an [unknown] engine being busy ~14.5% of the time. The Render/3D is at ~0.3% (I use the A770M as my display adaptor).
Thanks!
So although clpeak
is not able to utilized GPU fully (which is a separate problem), at least both are running at full speed, i.e. things are comparable.
PS. On iGPUs, Render/3d
is the pipeline with shader cores, which can act in 3D, compute or media mode. Whereas Arc has separate compute engine pipeline. If you build latest IGT version from upstream, it shows [unknown]
is the Compute
engine: https://gitlab.freedesktop.org/drm/igt-gpu-tools
Hello FreddieWitherden and BA8F0D39
Github issue #600 is still open while answers are already delivered in the middle. Let’s summarize this topic and go for closure. For best and low latency in Arc/Flex dGFX family please consider using Intel OEM kernel which can be obtained following the instruction: https://dgpu-docs.intel.com/installation-guides/index.html: latest version as a package https://dgpu-docs.intel.com/releases/stable_555_20230124.html and source code to build on your own https://github.com/intel-gpu/intel-gpu-i915-backports . OEM kernel driver contains significant change of dispatch model, which allows enabling low latency submission by Compute-Runtime drivers. Direct submission is not available in upstream generic kernel, as mentioned for 6.1/6.2.
With upstreamed 6.1/6.2 generic Linux kernel compute-runtime driver uses legacy submission model, it is same unified path for integrated and discrete Gfx devices. With legacy submission longer kernel latency timings are expected.
Okay sounds good. Are there any plans to upstream the direct submission paths to the mainline Linux kernel?
At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.
@BartusW Last stable release was in 2023-01-24 Will there be an update?
Are you asking about VM-Bind (the answer is covered above) or just asking in general about kernel launch latency changes?
@BartusW I mean, will the kernel packages at https://repositories.intel.com/graphics/ be updated?
At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.
Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?
At this moment there is no plan to upstream VM-Bind capability into 6.2 generic kernel.
Will the new dispatch model be added into the upstream i915 and Xe KMD in the future? Or is it an OEM exclusive feature?
Xe kernel driver is based on VM_BIND. For more info, see:
- Merge plan (April 2023): https://lore.kernel.org/dri-devel/[email protected]/
- Initial submission (Dec 2022): https://patchwork.freedesktop.org/series/112188/
I've read somewhere that there will be no VM_BIND in i915, which will speed up the upstreaming of Xe KMD.
It has become clear that we have a long way towards fully featured implementation of VM_BIND in i915. Examples of the many challenges include integration with display, integration with userspace drivers, a rewrite of all the i915 IGTs to support execbuf3, alignment with DRM GPU VA manager[1] etc.
We are stopping further VM_BIND upstreaming efforts in i915 so we can accelerate the merge plan for the new drm/xe driver[2] which has been designed for VM_BIND from the beginning.
https://lists.freedesktop.org/archives/intel-gfx/2023-April/324237.html
(Disclaimer: I'm not a driver developer, just a spectator, so this is just my clueless observation.)
Doing major architectural change into 915 kernel driver seems practical impossibility to me because it supports 1.5 decades of different Intel GPU HW, I think more than what's listed on this page: https://en.wikipedia.org/wiki/Intel_Graphics_Technology
That span covers multiple generations of user-space compute (OpenCL & SYCL), media, 3D (OpenGL/GLES & Vulkan) drivers (maybe 10 different driver version?), which all would need to be tested and validated, some on very old / hard to get HW, and after large change like that, there likely were still quite a few bugs that users would find only later on. And I do not not see the alternative of kernel driver dropping support for anything older than what the last user-space driver generation supports (i.e. Broadwell up), would be accepted by users either.