open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

nv_drm_atomic_commit can silently fail due to down_interruptible in nvkms_ioctl_common being interrupted, causing the Jay wayland compositor to hang

Open khyperia opened this issue 7 months ago • 2 comments

NVIDIA Open GPU Kernel Modules Version

570.133.07

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

Arch Linux

Kernel Release

6.14.2-arch1-1

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

NVIDIA GeForce RTX 2080

Describe the bug

When running the Jay wayland compositor, the screen freezes after a few seconds/minutes of use.

To Reproduce

Run the Jay wayland compositor. Wait a few minutes (move the mouse around/etc. until it freezes)

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

The long story short, this down_interruptible call being interrupted is the cause of the issue. Replacing it with while ((status = down_interruptible(&nvkms_lock)) != -4); or just down(&nvkms_lock); (and removing the status variable) fully fixes the issue for me and makes Jay be fully stable (I am writing this post in Jay on nvidia at the moment).

https://github.com/NVIDIA/open-gpu-kernel-modules/blob/41595798880394d9c6fa5e76b37ffc1bc428cfb2/kernel-open/nvidia-modeset/nvidia-modeset-linux.c#L1266

The call stack:

In reverse order:

  • down_interruptible is interrupted, returning -EINTR
  • which causes KmsFlip to fail (and print "NVKMS_IOCTL_FLIP ioctl failed"), returning false (instead of bubbling up -EINTR)
  • which causes applyModeSetConfig to return false
  • which makes nv_drm_atomic_apply_modeset_config return -EINVAL
  • causing nv_drm_atomic_commit to fail (and print "Failed to apply atomic modeset. Error code: -22"). However, nv_drm_atomic_commit still returns 0 to userland, success.

Some comments: In nv_drm_atomic_commit, it states:

/*
 * nv_drm_atomic_commit_internal() must not return failure after
 * calling drm_atomic_helper_swap_state().
 */

(presumably, nv_drm_atomic_commit used to be called nv_drm_atomic_commit_internal). However, after calling drm_atomic_helper_swap_state, it calls drm_atomic_helper_swap_state, which eventually calls down_interruptible, which can obviously "fail" (be interrupted). The guess is that nvkms_ioctl_common was never intended to be called in such a strict error handling context, hence the use of down_interruptible rather than down.

Additionally, the author of Jay said that this is likely an interaction with io_uring, with respect to signals and interruption.

I originally filed this issue with Jay. Some comments from the author of Jay and additional context can be found here: https://github.com/mahkoh/jay/issues/425

khyperia avatar Apr 20 '25 10:04 khyperia

Thank you for the detailed bug report and analysis. This issue is being tracked internally via bug 5236368.

AlexGoinsNV avatar Apr 22 '25 01:04 AlexGoinsNV

Thank you for looking into this!

Just an update, the author of Jay committed https://github.com/mahkoh/jay/pull/441 to the Jay project which makes it significantly less likely that this issue reproduces easily (less likely that an interrupt happens from io_uring), although this issue still needs to be fixed. So, additional instructions to reproduce easily would be to use a version of Jay before that commit - I believe 1.10.0 is the latest release before that commit. Hopefully you can get a repro using your own internal code and not through Jay, though!

khyperia avatar Apr 25 '25 16:04 khyperia

This is marked as fixed in the new beta driver

https://www.nvidia.com/en-us/drivers/details/251355/

C0rn3j avatar Aug 04 '25 21:08 C0rn3j

@khyperia could you revert the PR you linked and test if the issue is fixed?

mahkoh avatar Aug 05 '25 04:08 mahkoh

Downpatching Jay to the commit before https://github.com/mahkoh/jay/pull/441 has the issue be fixed when using the AUR nvidia-open-beta package, which is on 580.65.06. Just to confirm the test is valid, it is also still broken on nvidia-open package which is on the old 575.64.05, so, yay!

khyperia avatar Aug 05 '25 08:08 khyperia

Thanks!

mahkoh avatar Aug 05 '25 08:08 mahkoh

Thanks for confirming!

mtijanic avatar Aug 05 '25 11:08 mtijanic