open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Xid 120: GSP load access page fault during driver init (575.51.02)

Open ptr1337 opened this issue 8 months ago • 0 comments

NVIDIA Open GPU Kernel Modules Version

575.51.02

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

CachyOS

Kernel Release

6.14.2

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

RTX 5080 mobile Max Q

Describe the bug

Hi,

an user is reporting an issue, when using the 575 Driver, that the nvidia driver did not got loaded. Looking through the logs following is visible:

Apr 19 13:16:41 Kyrios kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 240
Apr 19 13:16:41 Kyrios kernel: NVRM: loading NVIDIA UNIX Open Kernel Module for x86_64  575.51.02  Release Build  (root@Kyrios)  
Apr 19 13:16:41 Kyrios kernel: nvidia-modeset: Loading NVIDIA UNIX Open Kernel Mode Setting Driver for x86_64  575.51.02  Release Build  (root@Kyrios)  
Apr 19 13:16:41 Kyrios kernel: [drm] [nvidia-drm] [GPU ID 0x00000200] Loading driver
Apr 19 13:16:41 Kyrios kernel: NVRM: testIfDsmSubFunctionEnabled: GPS ACPI DSM called before _acpiDsmSupportedFuncCacheInit subfunction = 11.
Apr 19 13:16:41 Kyrios kernel: NVRM: kgspHealthCheck_TU102: ****************************** GSP-CrashCat Report *******************************
Apr 19 13:16:41 Kyrios kernel: NVRM: GPU at PCI:0000:02:00: GPU-751acba2-df95-76f6-5914-58abf0c3cba3
Apr 19 13:16:41 Kyrios kernel: NVRM: Xid (PCI:0000:02:00): 120, GSP task exception: load access page fault (cause:0xd) @ pc:0x140ca4a, partition:4#0, task:3
Apr 19 13:16:41 Kyrios kernel: NVRM:     Reported by libos partition:4#5 kernel v3.1 [0] @ ts:2
Apr 19 13:16:41 Kyrios kernel: NVRM:     RISC-V CSR State:
Apr 19 13:16:41 Kyrios kernel: NVRM:         sstatus:0x0000000200000020  sscratch:0xffffffffa30144d0     sie:0x0000000000000220  sip:0x0000000000000000
Apr 19 13:16:41 Kyrios kernel: NVRM:         sepc:0x000000000140ca4a     stval:0x0000000000000000  scause:0x000000000000000d
Apr 19 13:16:41 Kyrios kernel: NVRM:     RISC-V GPR State:
Apr 19 13:16:41 Kyrios kernel: NVRM:         ra:0x000000000140d0f6   sp:0x00000047f240f5b0   gp:0x0000000000000000   tp:0x0000000000000000
Apr 19 13:16:41 Kyrios kernel: NVRM:         a0:0x0000000000000000   a1:0x00000047eb220530   a2:0x0000000000000004   a3:0x00000047f2a41000
Apr 19 13:16:41 Kyrios kernel: NVRM:         a4:0x0000000000000000   a5:0x0000000000000000   a6:0x0000000000001010   a7:0x0000000000000004
Apr 19 13:16:41 Kyrios kernel: NVRM:         s0:0x00000047f240f740   s1:0x00000047eb4442d0   s2:0x0000000000000002   s3:0x00000000017d7c26
Apr 19 13:16:41 Kyrios kernel: NVRM:         s4:0x00000000040d36b0   s5:0x00000000001a8000   s6:0x00000047eb3805f0   s7:0x0000000000001500
Apr 19 13:16:41 Kyrios kernel: NVRM:         s8:0x00000000040d3bc8   s9:0x0000000000000000  s10:0x0000000000000000  s11:0x00000047eb37e5f0
Apr 19 13:16:41 Kyrios kernel: NVRM:         t0:0x0000000000000020   t1:0x0000000000000001   t2:0x0000000000000000   t3:0x0000000000000020
Apr 19 13:16:41 Kyrios kernel: NVRM:         t4:0x0000000000000000   t5:0x00000047f240f3c1   t6:0x0000000000000020
Apr 19 13:16:41 Kyrios kernel: NVRM:     Stack Trace:
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x000000000140ca4a
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x00000000017d7c26
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x00000000017de386
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x00000000017dfca8
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x00000000017d66b2
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x00000000014164f2
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x0000000001a259ee
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x0000000001a483f8
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x0000000001b8486c
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x0000000001a2a74e
Apr 19 13:16:41 Kyrios kernel: NVRM:     Local I/O Register State:
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x01450800:0x00000000   0x01450900:0xbadf202b   0x01450a00:0x00000000   0x01450c00:0x00000000
Apr 19 13:16:41 Kyrios kernel: NVRM:         0x01454a00:0x810400d0   0x01454b00:0x010800d0   0x01454c00:0x00080000   0x01400200:0x00000040
Apr 19 13:16:41 Kyrios kernel: NVRM:     ------------[ end crash report ]------------
Apr 19 13:16:41 Kyrios kernel: NVRM: GPU0 GSP RPC buffer contains function 4128 (GSP_POST_NOCAT_RECORD) and data 0x0000000000000005 0x00000000017d7c26.
Apr 19 13:16:41 Kyrios kernel: NVRM: GPU0 RPC history (CPU -> GSP):
Apr 19 13:16:41 Kyrios kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration actively_polling
Apr 19 13:16:41 Kyrios kernel: NVRM:      0    73   SET_REGISTRY          0x0000000000000000 0x0000000000000000 0x000633274fe976ff 0x0000000000000000          y
Apr 19 13:16:41 Kyrios kernel: NVRM:     -1    72   GSP_SET_SYSTEM_INFO   0x0000000000000000 0x0000000000000000 0x000633274fe976fc 0x0000000000000000           
Apr 19 13:16:41 Kyrios kernel: NVRM: GPU0 RPC event history (CPU <- GSP):
Apr 19 13:16:41 Kyrios kernel: NVRM:     entry function                   data0              data1              ts_start           ts_end             duration during_incomplete_rpc
Apr 19 13:16:41 Kyrios kernel: NVRM:      0    4128 GSP_POST_NOCAT_RECORD 0x0000000000000005 0x00000000017d7c26 0x000633274ff1351b 0x000633274ff1351d      2us y
Apr 19 13:16:41 Kyrios kernel: NVRM: kgspRcAndNotifyAllChannels_IMPL: RC all user channels for critical error 120.
Apr 19 13:16:41 Kyrios kernel: NVRM: kgspHealthCheck_TU102: **********************************************************************************
Apr 19 13:16:41 Kyrios kernel: NVRM: nvCheckOkFailedNoLog: Check failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from rpcRecvPoll(pGpu, pRpc, NV_VGPU_MSG_EVENT_GSP_INIT_DONE) @ kernel_gsp.c:4878
Apr 19 13:16:41 Kyrios kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Reset required [NV_ERR_RESET_REQUIRED] (0x00000062) returned from kgspWaitForRmInitDone(pGpu, pKernelGsp) @ kernel_gsp_gh100.c:952
Apr 19 13:16:41 Kyrios kernel: NVRM: _kgspBootGspRm: unexpected WPR2 already up, cannot proceed with booting GSP
Apr 19 13:16:41 Kyrios kernel: NVRM: _kgspBootGspRm: (the GPU is likely in a bad state and may need to be reset)
Apr 19 13:16:41 Kyrios kernel: NVRM: RmInitAdapter: Cannot initialize GSP firmware RM
Apr 19 13:16:41 Kyrios kernel: NVRM: iovaspaceDestruct_IMPL: 1 left-over mappings in IOVAS 0x200
Apr 19 13:16:41 Kyrios kernel: NVRM: GPU 0000:02:00.0: RmInitAdapter failed! (0x62:0x40:1941)
Apr 19 13:16:41 Kyrios kernel: NVRM: GPU 0000:02:00.0: rm_init_adapter failed, device minor number 0

Also, what I also see, that it reports that the that there is a mismatch across several nvidia libaries with the version:

Apr 19 12:34:42 Kyrios kernel: NVRM: API mismatch: the client 'nvidia-powerd' (pid 849)
                               NVRM: has the version 570.133.07, but this kernel module has
                               NVRM: the version 575.51.02.  Please make sure that this
                               NVRM: kernel module and all NVIDIA driver components
                               NVRM: have the same version.

I have verified with the user, that all packages are correctly installed and also checked if the checksums are fine - which seem to be. Im not sure, why the nvidia-powerd is reporting the 570.133.07 driver - maybe due the above GSP crash?

To Reproduce

  1. Install archlinux
  2. Install the nvidia-beta driver from here https://archive.cachyos.org/nvidia/575/
  3. Boot into system and check "nvidia-smi"
  4. Verify the logs

System Specs: https://www.lenovo.com/us/en/p/laptops/legion-laptops/legion-pro-series/legion-pro-7i-gen-10-16-inch-intel/len101g0039 Core Ultra 9 275HX RTX 5080 mobile

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report-gsp-crash-575.log.gz

More Info

No response

ptr1337 avatar Apr 19 '25 20:04 ptr1337