open-gpu-kernel-modules
open-gpu-kernel-modules copied to clipboard
nv_drm_atomic_commit can silently fail due to down_interruptible in nvkms_ioctl_common being interrupted, causing the Jay wayland compositor to hang
NVIDIA Open GPU Kernel Modules Version
570.133.07
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
Arch Linux
Kernel Release
6.14.2-arch1-1
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
NVIDIA GeForce RTX 2080
Describe the bug
When running the Jay wayland compositor, the screen freezes after a few seconds/minutes of use.
To Reproduce
Run the Jay wayland compositor. Wait a few minutes (move the mouse around/etc. until it freezes)
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
The long story short, this down_interruptible call being interrupted is the cause of the issue. Replacing it with while ((status = down_interruptible(&nvkms_lock)) != -4); or just down(&nvkms_lock); (and removing the status variable) fully fixes the issue for me and makes Jay be fully stable (I am writing this post in Jay on nvidia at the moment).
https://github.com/NVIDIA/open-gpu-kernel-modules/blob/41595798880394d9c6fa5e76b37ffc1bc428cfb2/kernel-open/nvidia-modeset/nvidia-modeset-linux.c#L1266
The call stack:
- The DRM system calls nv_drm_atomic_commit, which calls nv_drm_atomic_apply_modeset_config
- which calls applyModeSetConfig
- which calls KmsFlip
- which calls nvkms_ioctl_from_kapi_try_pmlock(NVKMS_IOCTL_FLIP)
- which calls nvkms_ioctl_common
- which calls down_interruptible to obtain a lock on the global
nvkms_lock
In reverse order:
- down_interruptible is interrupted, returning
-EINTR - which causes KmsFlip to fail (and print "NVKMS_IOCTL_FLIP ioctl failed"), returning
false(instead of bubbling up-EINTR) - which causes applyModeSetConfig to return false
- which makes nv_drm_atomic_apply_modeset_config return
-EINVAL - causing nv_drm_atomic_commit to fail (and print "Failed to apply atomic modeset. Error code: -22"). However, nv_drm_atomic_commit still returns 0 to userland, success.
Some comments:
In nv_drm_atomic_commit, it states:
/*
* nv_drm_atomic_commit_internal() must not return failure after
* calling drm_atomic_helper_swap_state().
*/
(presumably, nv_drm_atomic_commit used to be called nv_drm_atomic_commit_internal). However, after calling drm_atomic_helper_swap_state, it calls drm_atomic_helper_swap_state, which eventually calls down_interruptible, which can obviously "fail" (be interrupted). The guess is that nvkms_ioctl_common was never intended to be called in such a strict error handling context, hence the use of down_interruptible rather than down.
Additionally, the author of Jay said that this is likely an interaction with io_uring, with respect to signals and interruption.
I originally filed this issue with Jay. Some comments from the author of Jay and additional context can be found here: https://github.com/mahkoh/jay/issues/425
Thank you for the detailed bug report and analysis. This issue is being tracked internally via bug 5236368.
Thank you for looking into this!
Just an update, the author of Jay committed https://github.com/mahkoh/jay/pull/441 to the Jay project which makes it significantly less likely that this issue reproduces easily (less likely that an interrupt happens from io_uring), although this issue still needs to be fixed. So, additional instructions to reproduce easily would be to use a version of Jay before that commit - I believe 1.10.0 is the latest release before that commit. Hopefully you can get a repro using your own internal code and not through Jay, though!
This is marked as fixed in the new beta driver
https://www.nvidia.com/en-us/drivers/details/251355/
@khyperia could you revert the PR you linked and test if the issue is fixed?
Downpatching Jay to the commit before https://github.com/mahkoh/jay/pull/441 has the issue be fixed when using the AUR nvidia-open-beta package, which is on 580.65.06. Just to confirm the test is valid, it is also still broken on nvidia-open package which is on the old 575.64.05, so, yay!
Thanks!
Thanks for confirming!