nvidia-driver icon indicating copy to clipboard operation
nvidia-driver copied to clipboard

[550.54.14] PRIME render offload stalls X startup

Open vishwin opened this issue 1 year ago • 28 comments

hw.nvidiadrm.modeset=1 set in /boot/loader.conf, kernel modules loaded from /etc/rc.conf. X startup stalls with the screen off after libinput initialises the last pointing device. Machine is otherwise responsive and X is able to be zapped.

550.54.14 Xorg.log

550.54.14 dmesg

535.146.02 dmesg (Xorg.0.log now missing :pensive:)

vishwin avatar Feb 29 '24 16:02 vishwin

One last little detail, what's the display setup for this look like? Just the laptop screen or is there an external monitor plugged in as well? When I try with an external monitor I hit the panic in #21, so I'm assuming you're not doing that.

Another thing to check would be that you have two cardN entries in /dev/dri/, but I'm assuming that's the case since it seems everything initializes correctly.

amshafer avatar Feb 29 '24 20:02 amshafer

Ah nvm, reproduced

amshafer avatar Feb 29 '24 21:02 amshafer

For whatever strange reason I can only reproduce this when I load nvidia-drm before amdgpu. Can you test and see if you see the same? Maybe by loading them manually just to verify, I don't know what order the rc.conf variable loads things in.

fwiw if I load amdgpu and then nvidia-drm it works fine.

amshafer avatar Feb 29 '24 21:02 amshafer

nvidia-drm has always been loaded after i915kms takes over the framebuffer from UEFI, as shown with the LinuxKPI I2C lines.

vishwin avatar Mar 01 '24 05:03 vishwin

One thing you can check while I keep looking at this is the contents of /usr/local/share/X11/xorg.conf.d/20-nvidia-drm-outputclass.conf and (if it exists) /usr/local/share/X11/xorg.conf.d/10-intel.conf:

root@:~ # cat /usr/local/share/X11/xorg.conf.d/20-nvidia-drm-outputclass.conf
Section "OutputClass"
    Identifier "nvidia"
    MatchDriver "nvidia-drm"
    Driver "nvidia"
    Option "PrimaryGPU" "yes"
    ModulePath "/usr/local/lib/nvidia/xorg"
    ModulePath "/usr/local/lib/xorg/modules"
EndSection
root@:~ # cat /usr/local/share/X11/xorg.conf.d/10-intel.conf                 
Section "OutputClass"
    Identifier "intel"
    MatchDriver "i915"
    Driver "modesetting"
    Option "PrimaryGPU" "yes"
EndSection

This is a working config for me on my intel PRIME machine, I'm wondering if your setup switched when the .conf files were overwritten during the latest package update and set the NVIDIA gpu as the primary. In that case you would see the black screen until you ran xrandr --auto. Note that if you do that right now or use an external monitor you'll still hit the panic I'm looking into.

You should be able to force Intel as the primary by ensuring Option "PrimaryGPU" "yes" is in the intel.conf, which you might have to create as iirc by default it isn't installed by a package. Hopefully that helps

amshafer avatar Mar 01 '24 16:03 amshafer

I have all of the above in xorg.conf.d/ except for Option "PrimaryGPU" "yes" under intel and specifying the nvidia module paths. Leaving them out worked in 535.146.02. Don't have access to the machine for another couple days so will update when I get back.

vishwin avatar Mar 01 '24 18:03 vishwin

Setting Option "PrimaryGPU" "yes" under intel allows X to continue bringing the displays/screens up, but this effectively becomes an Intel-only setup, as if the nvidia modules were never loaded. All rendering, GL providers, etc are done by intel via Mesa.

In 535.146.02, I never had to run any xrandr command for the nvidia (headless) to handle rendering whilst intel handled display. On this version, when trying to execute the recommended xrandr commands at any point, with nvidia as PrimaryGPU:

% xrandr --setprovideroutputsource modesetting NVIDIA-0
X Error of failed request:  BadValue (integer parameter out of range for operation)
  Major opcode of failed request:  140 (RANDR)
  Minor opcode of failed request:  35 (RRSetProviderOutputSource)
  Value in failed request:  0x217
  Serial number of failed request:  16
  Current serial number in output stream:  17
% xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x217 cap: 0x0 crtcs: 0 outputs: 0 associated providers: 0 name:NVIDIA-0
Provider 1: id: 0x241 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 8 associated providers: 0 name:modesetting

Note that with intel as PrimaryGPU:

% xrandr --listproviders
Providers: number : 2
Provider 0: id: 0x49 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 8 associated providers: 0 name:modesetting
Provider 1: id: 0x2c7 cap: 0x0 crtcs: 0 outputs: 0 associated providers: 0 name:NVIDIA-G0

vishwin avatar Mar 06 '24 01:03 vishwin

Does it work with NVIDIA as the primary GPU if you run with xrandr --auto though? That's the missing bit for me, until I do that the laptop screen stays black. I don't know why that would suddenly be required again in 550, the logic for deciding this stuff in the X server can be wacky sometimes.

amshafer avatar Mar 06 '24 16:03 amshafer

xrandr --auto didn't do anything, so no.

vishwin avatar Mar 07 '24 16:03 vishwin

Okay so that's different to what I've seen then. Out of curiosity in PrimaryGPU intel mode does running things on the NVIDIA GPU through the prime env variables work? i.e. something like:

$ __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia glxinfo | grep vendor
server glx vendor string: NVIDIA Corporation
client glx vendor string: NVIDIA Corporation
OpenGL vendor string: NVIDIA Corporation

Sorry for all the requests, since I don't reproduce exactly what you're seeing I'm just trying to figure out what's working.

amshafer avatar Mar 07 '24 19:03 amshafer

glxinfo with those environment variables worked. But of course I don't want to keep passing them.

vishwin avatar Mar 08 '24 04:03 vishwin

There are issues with the prebuilt nvidia-drm pkg, is that what you are using? Or are you building from ports? If you're not building from ports can you give that a try?

related: https://reviews.freebsd.org/D44308

amshafer avatar Mar 13 '24 20:03 amshafer

all only ever built from ports

vishwin avatar Mar 14 '24 05:03 vishwin

fwiw adding __NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia to .xprofile as a test to forcibly mimic the old behaviour results in generally unusable rendered results. Even alacritty (GPU-accelerated terminal) results in a black (unrendered) window.

vishwin avatar Mar 14 '24 17:03 vishwin

D44308 allows X startup to continue and eventually return to the old behaviour from 535.146.02. However, rendering is a bit glitchy, occasionally showing the immediate previous frames, especially around the refresh rate such as watching high frame rate video or fast typing.

vishwin avatar Mar 15 '24 20:03 vishwin

Some progress is good. What desktop env/etc is this with? Also what drm-kmod version are you using?

amshafer avatar Mar 15 '24 22:03 amshafer

Latest -CURRENT so latest drm-61-kmod due to the API change. Desktop is Cinnamon, which I've been needing to update for time, especially recently as muffin has been sus.

vishwin avatar Mar 16 '24 00:03 vishwin

I still haven't been able to reproduce any of the misrendering issues which is odd. I'll have to give Cinnamon a try.

Can you include the conftest results from 535 and 550 if possible? Just to check that nothing obvious went wrong with the compatibility detection. Something like cat work/NVIDIA.../(nvidia for 535)/src/nvidia-drm/conftest/* should grab the function.h, type.h, etc that get generated during the build

amshafer avatar Mar 19 '24 14:03 amshafer

Finally back on the target machine; latest upstream Cinnamon (not in ports yet) still rendering glitchy with occasional falls off the bus. More pronounced with a multiple-screen setup. Let me see if I can get the conftest

vishwin avatar May 28 '24 03:05 vishwin

Wait so with 535 everything works fine (including no glitching) but with 550 it falls off the bus? That's very odd, usually falling off the bus is indicative of some kind of power issue? I'd double check that 535 doesn't also fall off the bus in order to confirm if there's a regression in 550.

Not to prematurely blame Cinnamon, but it would be interesting to see if your glitchy rendering happens on xfce4 as well. If xfce also shows the glitching and it doesn't happen with 535 I'd take that as confirmation that something is wrong with nvidia-drm.

amshafer avatar May 28 '24 12:05 amshafer

535 does not suffer from glitchy rendering but also falls off the bus occasionally. However, the glitchiness isn't really noticeable on a single screen setup, like just the laptop display, but is certainly pronounced with multiple screens like my laptop display + external monitor.

The falling off the bus seem to trigger randomly mostly on pure GTK programs, particularly simpler dialog box or settings-type stuff, as if it is struggling to render something that shouldn't need much effort to draw. Specifically, I've had it happen with scrolling through a settings dialog, clicking a button that I can't release because the GPU falls of the bus right there, but also just rendering a PDF/image preview in the file manager a couple times. Could have to do with compositing? I'm dubious about power issues as the GPU itself is headless and not exactly replaceable, and these have all happened whilst plugged in.

vishwin avatar May 28 '24 15:05 vishwin

Won't be able to properly test xfce until after returning from BSDCan and SELF mid-next month because the external monitor will not be available for those.

vishwin avatar May 28 '24 15:05 vishwin

535 does not suffer from glitchy rendering

Seems like I need to test with Cinnamon then. I don't think I've ever tried that before, although last time I looked into this issue it was with XFCE and I didn't see glitching there.

but is certainly pronounced with multiple screens like my laptop display + external monitor.

What is the glitching like? Color corruption or tearing or something else? Normally I'd say something like this is an issue with the compositor but since it doesn't happen on 535 it sounds like something triggered by nvidia-drm.

The falling off the bus still seems unrelated, and like I said really is normally something to do with power. Even if it's plugged in I think it normally still goes through the battery which can go bad, but you might be able to disable the battery completely and then test if your laptop bios allows it.

amshafer avatar May 28 '24 16:05 amshafer

Glitchiness not so much tearing (which I always expect), but rather to the effect of laggy refresh rate and momentary displays of previous frames. Most pronounced when viewing a 60 fps video on a 60 Hz refresh rate display.

I no longer have an internal battery so I disconnected the external battery, we'll see what happens.

vishwin avatar May 28 '24 18:05 vishwin

Just experienced a falling off the bus without the battery.

vishwin avatar May 28 '24 23:05 vishwin

Any ACPI or other power messages in dmesg before it falls off the bus?

The laggy frames does sound interesting, that could conceivably be explained by nvidia-drm. Last time I tried reproducing with simple programs, so I'll try with a fullscreen video.

amshafer avatar May 30 '24 15:05 amshafer

Any ACPI or other power messages in dmesg before it falls off the bus?

never

vishwin avatar May 30 '24 15:05 vishwin

Got 535.154.05 built and running. Here are conftest results between 535 and 555: https://people.freebsd.org/~vishwin/nvidia/

vishwin avatar Aug 01 '24 13:08 vishwin