compute-runtime icon indicating copy to clipboard operation
compute-runtime copied to clipboard

GPU hangs when using GLK

Open cdecker08 opened this issue 1 year ago • 5 comments

Users have been seeing errors when tonemapping with ffmpeg on kernels 5.17 or newer. I have attached the results from one such user when run with PrintDebugMessages and PrintIoctlEntries. The line that caught my eye is ERROR: GPU HANG detected!. User is running Ubuntu 22.04.4 LTS (Kernel 6.5.0).

log.txt

cdecker08 avatar Jul 23 '24 19:07 cdecker08

Is this with 5.17 version of i915 kernel GPU driver (which would be very old), or with newer i915 DKMS? If latter, which version?

eero-t avatar Aug 23 '24 11:08 eero-t

How would I figure that out?

cdecker08 avatar Aug 23 '24 11:08 cdecker08

I missed that the log was from 6.5 kernel, which is rather newer. dkms status tells if there are DKMS kernel packages installed, and the related DEB packages can be listed with dpkg -l *dkms*.

Not that there are quite a lot of reasons why there may be GPU hangs:

  • As FFmpeg does much more media than compute operations, hang could be due to issue in media driver operation, not compute one: https://github.com/intel/media-driver/issues/
  • Or issue is with driver interaction => it would help to find minimal operation triggering the hang
  • Or it's a kernel driver issue: https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html
  • Or the tonemapping operation is just so slow on GLK, that it triggers kernel GPU hang checks timer

Last one can be checked by greatly increasing the hang timer, or by disabling it completely, in case operation will actually complete if given enough time: https://www.intel.com/content/www/us/en/docs/oneapi/installation-guide-linux/2023-2/gpu-disable-hangcheck.html

Note: if you disable hang check completely, have remote access to the computer in case it really was a real GPU hang!

Unless operation is just too slow for default hang timer value, I think it's best to report this to against media driver as it's a FFmpeg use-case.

eero-t avatar Aug 23 '24 12:08 eero-t

Of the timing values under sysfs, I think the hang timer is the "heartbeat" one:

$ head /sys/class/drm/card0/engine/*/*heart*_ms
==> /sys/class/drm/card0/engine/bcs0/heartbeat_interval_ms <==
2500

==> /sys/class/drm/card0/engine/rcs0/heartbeat_interval_ms <==
2500

==> /sys/class/drm/card0/engine/vcs0/heartbeat_interval_ms <==
2500

==> /sys/class/drm/card0/engine/vecs0/heartbeat_interval_ms <==
2500

Does dmsg tell which of the above 3 GLK engines (copy, 3D/compute, video) is non-responsive too long?

eero-t avatar Aug 23 '24 12:08 eero-t

Ouch, I just noticed this from the log: vaapi=vaapi:/dev/dri/renderD128,driver=i965

Which is: https://packages.ubuntu.com/jammy/i965-va-driver

I.e. the legacy driver for HW before GLK, instead of a media driver that is still supported by Intel:

  • https://packages.ubuntu.com/jammy/intel-media-va-driver
  • https://packages.ubuntu.com/jammy/intel-media-va-driver-non-free

eero-t avatar Aug 23 '24 12:08 eero-t

@cdecker08 Could you retest using latest release for GLK platform? https://github.com/intel/compute-runtime/releases/tag/24.35.30872.36

JablonskiMateusz avatar Aug 06 '25 06:08 JablonskiMateusz

Hi @cdecker08,

We’d like to know if this issue is still affecting you. If so, please provide an update or any additional information. Otherwise, we’ll close this issue after 30 days of inactivity. Your feedback is appreciated!

kgibala avatar Oct 15 '25 09:10 kgibala