open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

DRM cause core dump & NVRM dmesg ERROR

Open paorie opened this issue 2 years ago • 3 comments

NVIDIA Open GPU Kernel Modules Version

nvidia-open-dkms 530.41.03-3

Does this happen with the proprietary driver (of the same version) as well?

I cannot test this

Operating System and Version

Arch Linux

Kernel Release

6.2.10-zen

Hardware: GPU

NVIDIA GeForce RTX 3060 Laptop GPU

Describe the bug

Machine: Alienware M15R7

Launching wayland session causes core dump. I've tryed from Plasma and Hyprland with nvidia-drm as backend. Also there is a strange error message in dmesg saying NVRM objClInitPcieChipset: *** Chipset Setup Function Error! and one on journalctl saying nvidia: module verification failed: signature and/or required key missing - tainting kernel by loging the boot the module seems to be loaded, but when i start session it causes core-dump. If i disable DRM backend hyperland wayland session starts with nvidia driver. No chance for kwin.

dmesg:

[    0.000000] BIOS-e820: [mem 0x0000000060d11000-0x0000000061571fff] ACPI NVS
[    0.000000] reserve setup_data: [mem 0x0000000060d11000-0x0000000061571fff] ACPI NVS
[    0.136418] ACPI: PM: Registering ACPI NVS region [mem 0x60d11000-0x61571fff] (8785920 bytes)
[    0.255264] ACPI: \_SB_.PC00.CNVW.WRST: New power resource
[    2.757687] NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  530.41.03  Release Build  (archlinux-builder@archalien)  
[    2.783117] nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  530.41.03  Release Build  (archlinux-builder@archalien)  
[    2.870841] NVRM objClInitPcieChipset: *** Chipset Setup Function Error!
[    6.933352] input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input13
[    6.933604] input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input14
[    6.968959] input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input15
[    6.968996] input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input16
[    7.111300] iwlwifi 0000:00:14.3: CNVI_SCU_SEQ_DATA_DW9: 0x10
[   67.112494] Asynchronous wait on fence NVIDIA:nvidia.prime:0 timed out (hint:submit_notify [i915])

journactl --grep "nvidia":

Apr 11 15:28:37 archalien kernel: Command line: initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=PARTUUID=f2b1acd4-dfa1-4a46-9f19-31b6c2489e5d zswap.e>
Apr 11 15:28:37 archalien kernel: Kernel command line: initrd=\intel-ucode.img initrd=\initramfs-linux-zen.img root=PARTUUID=f2b1acd4-dfa1-4a46-9f19-31b6c2489e5d >
Apr 11 15:28:37 archalien kernel: nvidia: loading out-of-tree module taints kernel.
Apr 11 15:28:37 archalien kernel: nvidia: module verification failed: signature and/or required key missing - tainting kernel
Apr 11 15:28:37 archalien kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 235
Apr 11 15:28:37 archalien kernel: nvidia 0000:01:00.0: enabling device (0006 -> 0007)
Apr 11 15:28:37 archalien kernel: nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
Apr 11 15:28:37 archalien kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  530.41.03  Release Build  (archlinux-builder@archalien)  
Apr 11 15:28:37 archalien kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  530.41.03  Release Build  (archlinux-builder@arc>
Apr 11 15:28:37 archalien kernel: nvidia-uvm: Loaded the UVM driver, major device number 511.
Apr 11 15:28:37 archalien kernel: [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
Apr 11 15:28:37 archalien kernel: [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
Apr 11 15:28:37 archalien systemd[1]: Starting Load/Save Screen Backlight Brightness of backlight:nvidia_wmi_ec_backlight...
Apr 11 15:28:37 archalien systemd[1]: Finished Load/Save Screen Backlight Brightness of backlight:nvidia_wmi_ec_backlight.
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=3 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input13
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=7 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input14
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=8 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input15
Apr 11 15:28:37 archalien kernel: input: HDA NVidia HDMI/DP,pcm=9 as /devices/pci0000:00/0000:00:01.0/0000:01:00.1/sound/card0/input16
Apr 11 15:28:38 archalien systemd[1]: Starting NVIDIA Persistence Daemon...
Apr 11 15:28:38 archalien systemd[1]: Starting nvidia-powerd service...
Apr 11 15:28:38 archalien /usr/bin/nvidia-powerd[894]: nvidia-powerd version:1.0(build 1)
Apr 11 15:28:38 archalien systemd[1]: Started NVIDIA Persistence Daemon.
Apr 11 15:28:38 archalien systemd[1]: nvidia-powerd.service: Main process exited, code=exited, status=1/FAILURE
Apr 11 15:28:38 archalien systemd[1]: nvidia-powerd.service: Failed with result 'exit-code'.
Apr 11 15:28:38 archalien systemd[1]: Failed to start nvidia-powerd service.
Apr 11 15:28:47 archalien systemd-coredump[2185]: [🡕] Process 2119 (Hyprland) of user 1000 dumped core.
                                                  
                                                  Stack trace of thread 2119:
                                                  #0  0x000055ddb5ac99e8 _ZN13CPluginSystem13getAllPluginsEv (Hyprland + 0x1939e8)
                                                  #1  0x000055ddb5a31aee _ZN13CrashReporter18createAndSaveCrashEi (Hyprland + 0xfbaee)

journactl --grep "kwin":

Apr 11 12:45:51 archalien systemd[2214]: plasma-kwin_wayland.service: Consumed 1.174s CPU time.
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_xkbcommon: XKB: inet:323:58: unrecognized keysym "XF86EmojiPicker"
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_xkbcommon: XKB: inet:324:58: unrecognized keysym "XF86Dictate"
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_core: Parse error in tiles configuration for monitor "ada15eeb-9ed6-5738-a180-bd9fe2361632" : "illegal value" Cre>
Apr 11 12:46:17 archalien kwin_x11[4178]: kwin_core: Parse error in tiles configuration for monitor "0d3998b5-12fb-5e5d-9844-298a9a2f96a3" : "illegal value" Cre>
Apr 11 12:46:18 archalien kwin_x11[4178]: kwin_platform_x11_standalone: QOpenGLContext::globalShareContext() is required
Apr 11 12:46:18 archalien kwin_x11[4178]: kwin_scene_opengl: Creating the OpenGL rendering failed:  "Could not initialize rendering context"
Apr 11 12:51:06 archalien systemd[3951]: plasma-kwin_x11.service: Consumed 3.876s CPU time.
Apr 11 12:51:15 archalien kernel: kwin_wayland[9195]: segfault at 0 ip 00007fd0eb81556b sp 00007ffdb36d6ec0 error 4 in libnvidia-allocator.so.530.41.03[7fd0eb80>
Apr 11 12:51:16 archalien systemd-coredump[9350]: [🡕] Process 9195 (kwin_wayland) of user 1000 dumped core.
                                                  
                                                  Stack trace of thread 9195:
                                                  #0  0x00007fd0eb81556b n/a (nvidia-drm_gbm.so + 0x1556b)
                                                  #1  0x00007fd0eb815838 n/a (nvidia-drm_gbm.so + 0x15838)
                                                  #2  0x00007fd0f805ce59 n/a (libgbm.so.1 + 0x4e59)
                                                  #3  0x00007fd0f805eab1 gbm_create_device (libgbm.so.1 + 0x6ab1)
                                                  #4  0x00007fd0fb161e74 _ZN4KWin10DrmBackend6addGpuERK7QString (libkwin.so.5 + 0x361e74)
                                                  #5  0x00007fd0fb15ef1b _ZN4KWin10DrmBackend10initializeEv (libkwin.so.5 + 0x35ef1b)
                                                  #6  0x0000561b851f1315 n/a (kwin_wayland + 0x5a315)
                                                  #7  0x0000561b851e723c n/a (kwin_wayland + 0x5023c)
                                                  #8  0x00007fd0f863c790 n/a (libc.so.6 + 0x23790)
                                                  #9  0x00007fd0f863c84a __libc_start_main (libc.so.6 + 0x2384a)
                                                  #10 0x0000561b851e8e95 n/a (kwin_wayland + 0x51e95)

journalctl --grep "hyprland"

❯ journalctl --grep "hyprland"
Apr 08 10:14:22 archalien sddm-helper[2192]: Starting Wayland user session: "/usr/share/sddm/scripts/wayland-session" "Hyprland"
Apr 08 10:14:23 archalien kernel: Hyprland[2208]: segfault at 10 ip 0000557959c00b28 sp 00007fff9c518740 error 4 in Hyprland[557959ad1000+15b000] likely on CPU 8 >
Apr 08 10:14:23 archalien systemd-coredump[2249]: [🡕] Process 2208 (Hyprland) of user 1000 dumped core.
                                                  
                                                  Stack trace of thread 2208:
                                                  #0  0x0000557959c00b28 _ZN13CPluginSystem13getAllPluginsEv (Hyprland + 0x18cb28)
                                                  #1  0x0000557959b6b13e _ZN13CrashReporter18createAndSaveCrashEi (Hyprland + 0xf713e)
                                                  #2  0x0000557959b07f3c _Z25handleUnrecoverableSignali (Hyprland + 0x93f3c)
                                                  #3  0x00007f9c6fb69f50 n/a (libc.so.6 + 0x38f50)
                                                  #4  0x00007f9c6fbb88ec n/a (libc.so.6 + 0x878ec)
                                                  #5  0x00007f9c6fb69ea8 raise (libc.so.6 + 0x38ea8)
                                                  #6  0x00007f9c6fb5353d abort (libc.so.6 + 0x2253d)
                                                  #7  0x00007f9c6fe9a833 _ZN9__gnu_cxx27__verbose_terminate_handlerEv (libstdc++.so.6 + 0x9a833)
                                                  #8  0x00007f9c6fea6d0c _ZN10__cxxabiv111__terminateEPFvvE (libstdc++.so.6 + 0xa6d0c)
                                                  #9  0x00007f9c6fea6d79 _ZSt9terminatev (libstdc++.so.6 + 0xa6d79)
                                                  #10 0x00007f9c6fea6fdd __cxa_throw (libstdc++.so.6 + 0xa6fdd)
                                                  #11 0x0000557959ad5a74 _ZN11CCompositor10initServerEv.cold (Hyprland + 0x61a74)
                                                  #12 0x0000557959afaa2b main (Hyprland + 0x86a2b)
                                                  #13 0x00007f9c6fb54790 n/a (libc.so.6 + 0x23790)
                                                  #14 0x00007f9c6fb5484a __libc_start_main (libc.so.6 + 0x2384a)
                                                  #15 0x0000557959b07e05 _start (Hyprland + 0x93e05)
                                                  ELF object binary architecture: AMD x86-64
Apr 08 10:14:24 archalien sddm-greeter[2278]: Reading from "/usr/local/share/wayland-sessions/hyprland.desktop"
Apr 08 10:14:24 archalien sddm-greeter[2278]: Reading from "/usr/share/wayland-sessions/hyprland.desktop"
Apr 08 10:14:50 archalien kernel: Hyprland[2970]: segfault at 10 ip 0000562e52fe1b28 sp 00007ffe87d89100 error 4 in Hyprland[562e52eb2000+15b000] likely on CPU 6 >
Apr 08 10:14:50 archalien systemd-coredump[2987]: [🡕] Process 2970 (Hyprland) of user 1000 dumped core.
                                                  
                                                  Stack trace of thread 2970:
                                                  #0  0x0000562e52fe1b28 _ZN13CPluginSystem13getAllPluginsEv (Hyprland + 0x18cb28)
                                                  #1  0x0000562e52f4c13e _ZN13CrashReporter18createAndSaveCrashEi (Hyprland + 0xf713e)

nvidia-smi -q:

==============NVSMI LOG==============

Timestamp                                 : Tue Apr 11 16:10:42 2023
Driver Version                            : 530.41.03
CUDA Version                              : 12.1

Attached GPUs                             : 1
GPU 00000000:01:00.0
    Product Name                          : NVIDIA GeForce RTX 3060 Laptop GPU
    Product Brand                         : GeForce
    Product Architecture                  : Ampere
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : N/A
    GPU UUID                              : GPU-3c621dcd-20d4-109f-7874-ee23c382942e
    Minor Number                          : 0
    VBIOS Version                         : 94.06.29.00.35
    MultiGPU Board                        : No
    Board ID                              : 0x100
    Board Part Number                     : N/A
    GPU Part Number                       : 2560-775-A1
    FRU Part Number                       : N/A
    Module ID                             : 1
    Inforom Version
        Image Version                     : G001.0000.03.03
        OEM Object                        : 2.0
        ECC Object                        : N/A
        Power Management Object           : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GSP Firmware Version                  : 530.41.03
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
    GPU Reset Status
        Reset Required                    : No
        Drain and Reset Recommended       : N/A
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x01
        Device                            : 0x00
        Domain                            : 0x0000
        Device Id                         : 0x256010DE
        Bus Id                            : 00000000:01:00.0
        Sub System Id                     : 0x0B541028
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 1
                Device Current            : 1
                Device Max                : 4
                Host Max                  : 4
            Link Width
                Max                       : 16x
                Current                   : 8x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 1000 KB/s
        Rx Throughput                     : 0 KB/s
        Atomic Caps Inbound               : N/A
        Atomic Caps Outbound              : N/A
    Fan Speed                             : N/A
    Performance State                     : P8
    Clocks Throttle Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    FB Memory Usage
        Total                             : 6144 MiB
        Reserved                          : 366 MiB
        Used                              : 195 MiB
        Free                              : 5582 MiB
    BAR1 Memory Usage
        Total                             : 8192 MiB
        Used                              : 8 MiB
        Free                              : 8184 MiB
    Compute Mode                          : Default
    Utilization
        Gpu                               : 0 %
        Memory                            : 4 %
        Encoder                           : 0 %
        Decoder                           : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    ECC Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Errors
        Volatile
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
        Aggregate
            SRAM Correctable              : N/A
            SRAM Uncorrectable            : N/A
            DRAM Correctable              : N/A
            DRAM Uncorrectable            : N/A
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows                         : N/A
    Temperature
        GPU Current Temp                  : 52 C
        GPU Shutdown Temp                 : 105 C
        GPU Slowdown Temp                 : 102 C
        GPU Max Operating Temp            : 87 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    Power Readings
        Power Management                  : N/A
        Power Draw                        : 17.97 W
        Power Limit                       : N/A
        Default Power Limit               : N/A
        Enforced Power Limit              : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Clocks
        Graphics                          : 210 MHz
        SM                                : 210 MHz
        Memory                            : 405 MHz
        Video                             : 555 MHz
    Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Default Applications Clocks
        Graphics                          : N/A
        Memory                            : N/A
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 2100 MHz
        SM                                : 2100 MHz
        Memory                            : 7001 MHz
        Video                             : 1950 MHz
    Max Customer Boost Clocks
        Graphics                          : N/A
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : 643.750 mV
    Fabric
        State                             : N/A
        Status                            : N/A

To Reproduce

Enable drm and try to start wayland session. Errors in dmesg appears every boot.

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

No response

paorie avatar Apr 11 '23 14:04 paorie

Use the 525.105.17 version!

gilvbp avatar May 09 '23 16:05 gilvbp

Use the 525.105.17 version!

The same issue occurs in that version. Apparently it happens when you log in with a monitor refresh rate higher than 60Hz. After you log in, changing from 60Hz to a higher frequency is fine. The problem only occurs when you try to log in with a frequency higher than 60Hz.

I'm using a GTX 1070. I'd downgrade to version 525.89.02 as that version didn't cause me any issues, but I'm having trouble compiling it with the current kernel version 6.3.1.

Kaoticz avatar May 13 '23 19:05 Kaoticz

@Kaoticz try the new driver 525.116.04, I'm using, so far so good.

gilvbp avatar May 19 '23 15:05 gilvbp