Timeout waiting for RPC from GSP!
NVIDIA Open GPU Kernel Modules Version
525.85.05
Does this happen with the proprietary driver (of the same version) as well?
I cannot test this
Operating System and Version
Arch Linux
Kernel Release
Linux [HOSTNAME] 6.1.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 24 Jan 2023 21:07:04 +0000 x86_64 GNU/Linux
Hardware: GPU
GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-071149ae-386e-0017-3b5b-7ea80801f725)
Describe the bug
When I open a OpenGL application, like Yamagi Quake II, at a certain point the whole system freezes, and run in like 1 FPS per second. I generally have to REISUB when this happens.
To Reproduce
- Open Yamagi Quake II
- Change workspace, open pavucontrol to select a new audio sink for the game, switch back
Bug Incidence
Always
nvidia-bug-report.log.gz
More Info
Related: #272
This looks a lot like nvbug 3806304.
I met this on 525.85.12 for A30.
This issue seems to exists on 525.60.13 for A40 and A100 as well. Please fix ASAP!
bug-nv-smi.txt
dmesg.txt
Hi @aritger, is there any solution about this?
I think "Timeout waiting for RPC from GSP!" is a pretty generic symptom, with many possible causes. The reproduction steps that lead up to it will matter to help distinguish different bugs, as will the other NVRM dmesg spew around it.
I don't have specific reason to expect it is already fixed, but it may be worth testing the most recent 525.89.02 driver. 530.xx drivers will hopefully be released soon, and they will have a lot of changes relative to 525.xx, so that will also be worth testing.
Beyond that, if you see similar "Timeout waiting for RPC from GSP!" messages, it is worth attaching a complete nvidia-bug-report.log.gz, and describing the steps that led to it, so that we can compare instances of the symptom.
Thanks @aritger,nvidia-bug-report.log.gz.
The problem occurs after running in the kubernetes environment for a period of time, and nvidia-smi will get stuck for a while. The specific error phenomenon is similar to @jelmd https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1445190598. FYI, I got some help from nvidia-docker community(https://github.com/NVIDIA/nvidia-docker/issues/1648#issuecomment-1441139460), but not sure if the root cause of this problem is driver or NVIDIA-docker related.
FWIIW: We do not use any nvidia-docker or similar bloat, just plain lxc and passthrough the devices to the related zones alias containers as needed. So IMHO the nvidia-container-toolkit is not really related to the problem.
Happend again on another machine:
...
[ +0.000094] ? _nv011159rm+0x62/0x2e0 [nvidia]
[ +0.000090] ? _nv039897rm+0xdb/0x140 [nvidia]
[ +0.000073] ? _nv041022rm+0x2ce/0x3a0 [nvidia]
[ +0.000103] ? _nv015438rm+0x788/0x800 [nvidia]
[ +0.000064] ? _nv039416rm+0xac/0xe0 [nvidia]
[ +0.000092] ? _nv041024rm+0xac/0x140 [nvidia]
[ +0.000095] ? _nv041023rm+0x37a/0x4d0 [nvidia]
[ +0.000070] ? _nv039319rm+0xc9/0x150 [nvidia]
[ +0.000151] ? _nv039320rm+0x42/0x70 [nvidia]
[ +0.000180] ? _nv000552rm+0x49/0x60 [nvidia]
[ +0.000219] ? _nv000694rm+0x7fb/0xc80 [nvidia]
[ +0.000195] ? rm_ioctl+0x54/0xb0 [nvidia]
[ +0.000132] ? nvidia_ioctl+0x6e3/0x850 [nvidia]
[ +0.000003] ? get_max_files+0x20/0x20
[ +0.000134] ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[ +0.000002] ? do_vfs_ioctl+0x407/0x670
[ +0.000003] ? __secure_computing+0xa4/0x110
[ +0.000002] ? ksys_ioctl+0x67/0x90
[ +0.000002] ? __x64_sys_ioctl+0x1a/0x20
[ +0.000002] ? do_syscall_64+0x57/0x190
[ +0.000002] ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[ +6.010643] NVRM: Xid (PCI:0000:83:00): 119, pid=2710030, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0050 0x0).
...
Mon Feb 27 16:08:33 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A40 On | 00000000:03:00.0 Off | 0 |
| 0% 24C P8 14W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:04:00.0 Off | 0 |
| 0% 25C P8 13W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A40 On | 00000000:43:00.0 Off | 0 |
| 0% 40C P0 81W / 300W | 821MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A40 On | 00000000:44:00.0 Off | 0 |
| 0% 30C P8 14W / 300W | 0MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 4 ERR! On | 00000000:83:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 0MiB / 46068MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A40 On | 00000000:84:00.0 Off | 0 |
| 0% 23C P8 14W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A40 On | 00000000:C3:00.0 Off | 0 |
| 0% 22C P8 15W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A40 On | 00000000:C4:00.0 Off | 0 |
| 0% 21C P8 14W / 300W | 3MiB / 46068MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 2 N/A N/A 2111354 C ...eratornet/venv/bin/python 818MiB |
+-----------------------------------------------------------------------------+
@jelmd +1, I ran into this problem again. Hi @aritger @Joshua-Ashton maybe this is a Driver issue, please take a look.
Fri Mar 3 14:41:55 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A30 Off | 00000000:01:00.0 Off | 0 |
| N/A 26C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A30 Off | 00000000:22:00.0 Off | 0 |
| N/A 40C P0 92W / 165W | 16096MiB / 24576MiB | 88% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A30 Off | 00000000:41:00.0 Off | 0 |
| N/A 42C P0 141W / 165W | 14992MiB / 24576MiB | 33% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 ERR! Off | 00000000:61:00.0 Off | ERR! |
|ERR! ERR! ERR! ERR! / ERR! | 23025MiB / 24576MiB | ERR! Default |
| | | ERR! |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A30 Off | 00000000:81:00.0 Off | 0 |
| N/A 26C P0 26W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A30 Off | 00000000:A1:00.0 Off | 0 |
| N/A 26C P0 28W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A30 Off | 00000000:C1:00.0 Off | 0 |
| N/A 25C P0 29W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A30 Off | 00000000:E1:00.0 Off | 0 |
| N/A 24C P0 25W / 165W | 0MiB / 24576MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
[Fri Mar 3 04:23:22 2023] NVRM: GPU at PCI:0000:61:00: GPU-e59ce3f9-af53-a0dd-1d2c-8beaa74aa635
[Fri Mar 3 04:23:22 2023] NVRM: GPU Board Serial Number: 1322621149782
[Fri Mar 3 04:23:22 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 04:23:22 2023] CPU: 72 PID: 1344368 Comm: nvidia-smi Tainted: P OE 5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar 3 04:23:22 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Fri Mar 3 04:23:22 2023] Call Trace:
[Fri Mar 3 04:23:22 2023] dump_stack+0x6b/0x83
[Fri Mar 3 04:23:22 2023] _nv011231rm+0x39d/0x470 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv011168rm+0x62/0x2e0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv040022rm+0xdb/0x140 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv015451rm+0x788/0x800 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039541rm+0xac/0xe0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041150rm+0xac/0x140 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039443rm+0xc9/0x150 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv039444rm+0x42/0x70 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv000554rm+0x49/0x60 [nvidia]
[Fri Mar 3 04:23:22 2023] ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Fri Mar 3 04:23:22 2023] ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar 3 04:23:22 2023] ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar 3 04:23:22 2023] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar 3 04:23:22 2023] ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar 3 04:23:22 2023] ? do_syscall_64+0x33/0x80
[Fri Mar 3 04:23:22 2023] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar 3 04:24:07 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 04:24:52 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 04:25:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 04:26:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 04:27:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 04:27:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 04:28:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 04:29:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:30:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar 3 04:30:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:31:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:32:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:33:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:33:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar 3 04:34:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar 3 04:35:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar 3 04:36:03 2023] INFO: task nvidia-smi:1346229 blocked for more than 120 seconds.
[Fri Mar 3 04:36:03 2023] Tainted: P OE 5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar 3 04:36:03 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar 3 04:36:03 2023] task:nvidia-smi state:D stack: 0 pid:1346229 ppid:1346228 flags:0x00000000
[Fri Mar 3 04:36:03 2023] Call Trace:
[Fri Mar 3 04:36:03 2023] __schedule+0x282/0x880
[Fri Mar 3 04:36:03 2023] ? rwsem_spin_on_owner+0x74/0xd0
[Fri Mar 3 04:36:03 2023] schedule+0x46/0xb0
[Fri Mar 3 04:36:03 2023] rwsem_down_write_slowpath+0x246/0x4d0
[Fri Mar 3 04:36:03 2023] os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Fri Mar 3 04:36:03 2023] _nv038505rm+0xc/0x30 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv041182rm+0x45/0xd0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv041127rm+0x142/0x2b0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039416rm+0x5b/0x90 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv039416rm+0x31/0x90 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000559rm+0x5a/0x70 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000559rm+0x33/0x70 [nvidia]
[Fri Mar 3 04:36:03 2023] ? _nv000694rm+0x94a/0xc80 [nvidia]
[Fri Mar 3 04:36:03 2023] ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar 3 04:36:03 2023] ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar 3 04:36:03 2023] ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar 3 04:36:03 2023] ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar 3 04:36:03 2023] ? do_syscall_64+0x33/0x80
[Fri Mar 3 04:36:03 2023] ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar 3 04:36:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar 3 04:36:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar 3 04:37:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar 3 04:38:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:39:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:39:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:40:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar 3 04:41:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar 3 04:42:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 04:42:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 04:43:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 04:44:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 04:45:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 04:45:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 04:46:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 04:47:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 04:48:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:48:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar 3 04:49:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:50:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:51:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:51:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 04:52:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar 3 04:53:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar 3 04:54:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar 3 04:54:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar 3 04:55:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar 3 04:56:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar 3 04:57:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:57:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:58:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar 3 04:59:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar 3 05:00:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar 3 05:00:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar 3 05:01:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar 3 05:02:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar 3 05:03:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar 3 05:03:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar 3 05:04:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar 3 05:05:26 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar 3 05:06:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar 3 05:06:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar 3 05:07:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
Also happening here on a A100-PCIE-40GB using driver 530.30.02 and CUDA 12.1.
Hi @lpla , what's your use case environment? In kubernetes?
There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.
There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.
Have you tried the 520.* driver ? Is it work?
FWIW: Most of our users use PyTorch as well. Perhaps it tortures GPUs too hard ;-)
We also use PyTorch on the GPUs, but the 470 driver used before has been more stable.
Yepp. np with 470, too.
There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.
Have you tried the 520.* driver ? Is it work?
That's my next test. In fact, that's exactly the version I was using before upgrading from Ubuntu 20.04 with kernel 5.15 and driver 520 to Ubuntu 22.04 with kernel 5.19 and driver 525 last month. It was working perfect with that previous setup.
Same thing on another machine. FWIW: I removed /usr/lib/firmware/nvidia/525.60.13 - perhaps this fixes the problem.
UPDATE
They have confirmed Xid 119 this bug. They said that the GSP feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.
Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the nvidia-bug-report.log.gz they advised us to disable GSP-RM.
- To disable GSP-RM:
sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'
- Enable the kernel
# if ubuntu
sudo update-initramfs -u
# if centos
dracut -f
- Reboot
- Check if work. If
EnableGpuFirmware: 0thenGSP-RMis disabled.
cat /proc/driver/nvidia/params | grep EnableGpuFirmware
Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)
I'm also seeing XID 119s on some 510 drivers. Have not tried 525 or 520.
Driver 525.60.13 with A40 GSP disable.But nvidia-but-report also have GSP timeout error.
Hi @stephenroller @liming5619 , maybe it's better to downgrade the driver version. On one hand, the GSP feature was introduced by NVIDIA since 510 but has not been fixed yet. On the other hand, 470 is an LTS version and has been running stably in our production environment for a long time. I have already downgraded the driver of the problematic node to 470.82.01 to match our other production nodes, just for your reference. :)
So far disabling GSP seems to have mitigated, but maybe I've just been lucky since. Will report back if I see counter-evidence.
Yepp, removing /usr/lib/firmware/nvidia/5xx.* seems to fix the problem, too (did not use NVreg_EnableGpuFirmware=0).
UPDATE
They have confirmed
Xid 119this bug. They said that theGSPfeature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the
nvidia-bug-report.log.gzthey advised us to disableGSP-RM.1. To disable GSP-RM:sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'2. Enable the kernel# if ubuntu sudo update-initramfs -u # if centos dracut -f3. Reboot 4. Check if work. If `EnableGpuFirmware: 0` then `GSP-RM` is disabled.cat /proc/driver/nvidia/params | grep EnableGpuFirmwareSince our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)
We disable it on 2 hosts - 8x A100 GPUs, If this workaround will work, I will also give a feedback.
Feedback: After a week, I can say all servers with the A100 boards are running stable after we disabled the GSP. No GPU crashes anymore.
@fighterhit thank you for sharing the workaround with us.
I have a similar issue, after disabling GSP, It took more than 5 minutes to give output as "True".
# cat /etc/modprobe.d/nvidia-gsp.conf
options nvidia NVreg_EnableGpuFirmware=0
# cat /proc/driver/nvidia/params | grep EnableGpuFirmware
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2
Strange thing is, I'm booting up vms with images which has GPU driver pre-installed, on a host with 4 cards, 2 out of 4 cards ends up with a similar issue.
Please suggest a fix as its hampering our prod environments. Please let me know if any additional commands or log output that i should provide.
We also have few requirements based on CUDA 11.8 and hence we cannot roll back to driver 470
@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.
@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.
Before disabling GSP the error was same as in this post:
NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL
OC) (0x0 0x6c).
But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".
@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.
Before disabling GSP the error was same as in this post:
NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL OC) (0x0 0x6c).But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".
Is there something where I can enable trace, or increase the log level to know the cause of this delay?