compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

Gemini Lake clpeak fails on Ubuntu 22.04 latest kernel

Open looi opened this issue 2 years ago • 9 comments

System: Dell Wyse 5070 Intel Celeron J4105 (Gemini Lake) Intel Compute Runtime 23.30.26918.9 installed with official instructions.

What works

  • Ubuntu 20.04 kernel 5.4: Works out of the box
  • Ubuntu 22.04 kernel 5.15: Doesn't work out of the box, but works with i915.enable_hangcheck=0 i915.request_timeout_ms=100000, see https://github.com/intel/compute-runtime/issues/497
  • Without the above params, clpeak hangs with the following kernel logs:
[  101.920695] Fence expiration time out i915-0000:00:02.0:clpeak[969]:454!
[  101.920754] Fence expiration time out i915-0000:00:02.0:clpeak[969]:452!
[  101.920760] Fence expiration time out i915-0000:00:02.0:clpeak[969]:450!

What doesn't work

  • Ubuntu 22.04 kernel 6.2
  • clpeak fails, and produces error clFinish (-5) with kernel logs (note that these logs were never seen with kernel 5.15):
[   70.897984] i915 0000:00:02.0: [drm] Resetting rcs0 for preemption time out
[   70.898106] i915 0000:00:02.0: [drm] clpeak[910] context reset due to GPU hang
[   70.907884] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:e757fefe, in clpeak [910]
...
[  170.596540] i915 0000:00:02.0: [drm] Resetting rcs0 for CS error
[  170.596659] i915 0000:00:02.0: [drm] clpeak[910] context reset due to GPU hang
[  170.606125] i915 0000:00:02.0: [drm] GPU HANG: ecode 9:1:ac045407, in clpeak [910]
  • Setting the params that fixed 5.15 (i915.enable_hangcheck=0 i915.request_timeout_ms=100000) or increasing to 200000 doesn't appear to help.
  • Vulkan compute (vkpeak) works fine. So this appears to be a problem specific to Intel Compute Runtime.

looi avatar Sep 28 '23 04:09 looi

I have the same problem on both Celeron J4125 and N4000. I am seeing the following error in GPU inference in OpenVINO 2023.1.0.

terminate called after throwing an instance of 'InferenceEngine::GeneralError'
  what(): [ GENERAL_ERROR ] Check 'false' failed at src/plugins/intel_gpu/src/plugin/program.cpp:401:
GPU program build failed!
[GPU] clWaitForEvents, error code: -14

Note that Linux kernel 5.15 does not have the problem.

Ar-Ray-code avatar Dec 08 '23 02:12 Ar-Ray-code

I heard about this issue a year ago but I no longer have a GLK device now. This is a kernel regression because Gemini Lake/GLK only fails when using the new kernel.

@looi @Ar-Ray-code Better to file an issue in drm/intel. https://gitlab.freedesktop.org/drm/intel/-/issues/?label_name%5B%5D=Community

nyanmisaka avatar Dec 08 '23 14:12 nyanmisaka

I don't think this is necessarily a kernel regression, because as I have stated above, vulkan compute works fine.

Personally, I have switched to using vulkan. The performance is comparable (especially making proper use of vulkan subgroups), but more importantly, it seems to be much more stable on both Windows and Linux. Intel Compute Runtime / OpenCL has weird issues like this one. Vulkan also seems to work much better on non-intel GPUs, especially nvidia, where they refuse to support basic features like subgroups and half-precision floats in OpenCL. So I feel like vulkan is the future and OpenCL is dying anyways.

looi avatar Dec 16 '23 00:12 looi

What works Ubuntu 22.04 kernel 5.15: Doesn't work out of the box, but works with i915.enable_hangcheck=0

What doesn't work Ubuntu 22.04 kernel 6.2

Your input suggests this is a kernel regression. The only difference between whether it works or not is the kernel version, bisect the commit between the two should find the culprit.

This isn't the first time I've seen i915 regression, last time it even broke both the Vulkan compute and OpenCL.

nyanmisaka avatar Dec 16 '23 00:12 nyanmisaka

I agree that a kernel change broke Intel Compute Runtime. I guess whether or not it's a kernel regression is a subjective question depending on what exactly caused the breakage. Maybe Intel Compute Runtime is making incorrect assumptions about i915 or relying on undefined behavior, in which case it would not be a kernel regression. Given that vulkan compute still works, I think it is a likely possibility.

looi avatar Dec 16 '23 00:12 looi