open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Atomic commit fails if IN_FENCE_FD is set

Open mahkoh opened this issue 1 year ago • 8 comments

NVIDIA Open GPU Kernel Modules Version

550.40.59

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

n/a

Kernel Release

n/a

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [ ] I am running on a stable kernel release.

Hardware: GPU

n/a

Describe the bug

When an atomic commit contains an IN_FENCE_FD, the nvidia driver rejects the commit with EPERM:

    if (plane_state->fence != NULL || nv_drm_plane_state->fd_user_ptr) {
        if (!nv_dev->supportsSyncpts) {
            return -1;
        }

-1 is EPERM.

First of all this is not a correct error code for this situation.

But more importantly, this makes it impossible to synchronize presentation with rendering from the compositor side. The workaround is to just not use IN_FENCE_FD if the nvidia driver is detected. Kinda absurd since nvidia was the vendor pushing for explicit sync in the first place.

If that is the expected workaround, then maybe the driver should just accept such commits instead of returning an error.


For whatever reason, one of my users reports that the commits are nevertheless applied by the nvidia driver even though the atomic commit fails. This might be a separate bug.

To Reproduce

n/a

Bug Incidence

Always

nvidia-bug-report.log.gz

n/a

More Info

I don't have nvidia hardware so I cannot fill out the remaining fields.

mahkoh avatar Apr 10 '24 09:04 mahkoh

@cubanismo is working on adding support for IN_FENCE_FD and OUT_FENCE_PTR to our nvidia-drm module, so these should be working properly in near-future driver release. By "properly" I mean leveraging the synchronization capabilities of the GPU's display hardware.

Having said that, I checked with him and we agreed that our current behavior of failing when IN_FENCE_FD is specified is wrong (and, yeah, returning -EPERM is extra wrong). Instead, we should be calling drm_atomic_helper_wait_for_fences to do a CPU wait on the fence. This is what the nouveau driver does, for example.

As you pointed out, there isn't really a good way for userspace to determine whether IN_FENCE_FD is supported, implying the intention was probably for it to always be supported. At worst using a CPU wait as described above.

If we had noticed this problem earlier we would most likely have made such a change as a short-term fix. However, at this point, since a better fix is right around the corner, there may not be any benefit to doing that.

erik-kz avatar Apr 12 '24 18:04 erik-kz

Thanks for the update.

PS: The issue I mentioned re "commit being applied anyway" turned out to be a bug on my side where I would render directly to the front buffer if too many commits failed in a row.

mahkoh avatar Apr 12 '24 18:04 mahkoh

It's nice to see this added in the 560.28.03 beta. Though testing this with cosmic-comp/smithay (https://github.com/Smithay/smithay/pull/1489) on a Intel+NVIDIA graphics laptop it seems to be running into an issue and logging "Failed to initialize semaphore for plane fence" to dmesg. I don't think the compositor is doing anything wrong here.

ids1024 avatar Jul 23 '24 22:07 ids1024

I can confirm that whilst this was added in the 560 drivers, it does not seem to be implemented fully correctly as ids1024 mentioned. I seem to be getting constant freezes and display lockups which is caused by an issue with the implementation.

fxzzi avatar Aug 18 '24 14:08 fxzzi

@ids1024, @Fxzzi, were you using more than one display? We identified a bug in the implementation in some situations on Wayland desktops with multiple displays. It should be resolved in subsequent 560-series drivers.

cubanismo avatar Aug 20 '24 01:08 cubanismo

Here is the relevant forum post for the fix refered to by @cubanismo.

4769163 - Screen freeze observed with directly scanned out windows on a multi-monitor configuration

https://forums.developer.nvidia.com/t/560-release-feedback-discussion/300830/210

Binary-Eater avatar Aug 20 '24 01:08 Binary-Eater

@ids1024, @Fxzzi, were you using more than one display? We identified a bug in the implementation in some situations on Wayland desktops with multiple displays. It should be resolved in subsequent 560-series drivers.

I am personally using 2 displays. Hopefully the issue you've resolved is the same as what I was experiencing.

fxzzi avatar Aug 20 '24 04:08 fxzzi

We released driver 560.35.03 yesterday, which contains the fix we are referring to for multiple displays.

https://www.nvidia.com/en-us/drivers/details/230918/

Binary-Eater avatar Aug 22 '24 16:08 Binary-Eater

Fix released as @Binary-Eater mentions above.

https://forums.developer.nvidia.com/t/560-release-feedback-discussion/300830/309 Internal bug 4769163

gauravjuvekar avatar Aug 24 '24 01:08 gauravjuvekar