open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Timeout waiting for RPC from GSP!

Open ghost opened this issue 2 years ago • 73 comments

NVIDIA Open GPU Kernel Modules Version

525.85.05

Does this happen with the proprietary driver (of the same version) as well?

I cannot test this

Operating System and Version

Arch Linux

Kernel Release

Linux [HOSTNAME] 6.1.8-arch1-1 #1 SMP PREEMPT_DYNAMIC Tue, 24 Jan 2023 21:07:04 +0000 x86_64 GNU/Linux

Hardware: GPU

GPU 0: NVIDIA GeForce RTX 3050 Laptop GPU (UUID: GPU-071149ae-386e-0017-3b5b-7ea80801f725)

Describe the bug

When I open a OpenGL application, like Yamagi Quake II, at a certain point the whole system freezes, and run in like 1 FPS per second. I generally have to REISUB when this happens.

To Reproduce

  1. Open Yamagi Quake II
  2. Change workspace, open pavucontrol to select a new audio sink for the game, switch back

Bug Incidence

Always

nvidia-bug-report.log.gz

nvidia-bug-report.log.gz

More Info

Related: #272

ghost avatar Jan 26 '23 14:01 ghost

This looks a lot like nvbug 3806304.

ttabi avatar Jan 26 '23 19:01 ttabi

I met this on 525.85.12 for A30.

fighterhit avatar Feb 23 '23 02:02 fighterhit

This issue seems to exists on 525.60.13 for A40 and A100 as well. Please fix ASAP! bug-nv-smi.txt dmesg.txt

jelmd avatar Feb 25 '23 19:02 jelmd

Hi @aritger, is there any solution about this?

fighterhit avatar Feb 27 '23 02:02 fighterhit

I think "Timeout waiting for RPC from GSP!" is a pretty generic symptom, with many possible causes. The reproduction steps that lead up to it will matter to help distinguish different bugs, as will the other NVRM dmesg spew around it.

I don't have specific reason to expect it is already fixed, but it may be worth testing the most recent 525.89.02 driver. 530.xx drivers will hopefully be released soon, and they will have a lot of changes relative to 525.xx, so that will also be worth testing.

Beyond that, if you see similar "Timeout waiting for RPC from GSP!" messages, it is worth attaching a complete nvidia-bug-report.log.gz, and describing the steps that led to it, so that we can compare instances of the symptom.

aritger avatar Feb 27 '23 05:02 aritger

Thanks @aritger,nvidia-bug-report.log.gz.

The problem occurs after running in the kubernetes environment for a period of time, and nvidia-smi will get stuck for a while. The specific error phenomenon is similar to @jelmd https://github.com/NVIDIA/open-gpu-kernel-modules/issues/446#issuecomment-1445190598. FYI, I got some help from nvidia-docker community(https://github.com/NVIDIA/nvidia-docker/issues/1648#issuecomment-1441139460), but not sure if the root cause of this problem is driver or NVIDIA-docker related.

fighterhit avatar Feb 27 '23 06:02 fighterhit

FWIIW: We do not use any nvidia-docker or similar bloat, just plain lxc and passthrough the devices to the related zones alias containers as needed. So IMHO the nvidia-container-toolkit is not really related to the problem.

jelmd avatar Feb 27 '23 09:02 jelmd

Happend again on another machine:

...
[  +0.000094]  ? _nv011159rm+0x62/0x2e0 [nvidia]
[  +0.000090]  ? _nv039897rm+0xdb/0x140 [nvidia]
[  +0.000073]  ? _nv041022rm+0x2ce/0x3a0 [nvidia]
[  +0.000103]  ? _nv015438rm+0x788/0x800 [nvidia]
[  +0.000064]  ? _nv039416rm+0xac/0xe0 [nvidia]
[  +0.000092]  ? _nv041024rm+0xac/0x140 [nvidia]
[  +0.000095]  ? _nv041023rm+0x37a/0x4d0 [nvidia]
[  +0.000070]  ? _nv039319rm+0xc9/0x150 [nvidia]
[  +0.000151]  ? _nv039320rm+0x42/0x70 [nvidia]
[  +0.000180]  ? _nv000552rm+0x49/0x60 [nvidia]
[  +0.000219]  ? _nv000694rm+0x7fb/0xc80 [nvidia]
[  +0.000195]  ? rm_ioctl+0x54/0xb0 [nvidia]
[  +0.000132]  ? nvidia_ioctl+0x6e3/0x850 [nvidia]
[  +0.000003]  ? get_max_files+0x20/0x20
[  +0.000134]  ? nvidia_frontend_unlocked_ioctl+0x3b/0x50 [nvidia]
[  +0.000002]  ? do_vfs_ioctl+0x407/0x670
[  +0.000003]  ? __secure_computing+0xa4/0x110
[  +0.000002]  ? ksys_ioctl+0x67/0x90
[  +0.000002]  ? __x64_sys_ioctl+0x1a/0x20
[  +0.000002]  ? do_syscall_64+0x57/0x190
[  +0.000002]  ? entry_SYSCALL_64_after_hwframe+0x5c/0xc1
[  +6.010643] NVRM: Xid (PCI:0000:83:00): 119, pid=2710030, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0050 0x0).
...
Mon Feb 27 16:08:33 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          On   | 00000000:03:00.0 Off |                    0 |
|  0%   24C    P8    14W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A40          On   | 00000000:04:00.0 Off |                    0 |
|  0%   25C    P8    13W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A40          On   | 00000000:43:00.0 Off |                    0 |
|  0%   40C    P0    81W / 300W |    821MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A40          On   | 00000000:44:00.0 Off |                    0 |
|  0%   30C    P8    14W / 300W |      0MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   4  ERR!                On   | 00000000:83:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |      0MiB / 46068MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A40          On   | 00000000:84:00.0 Off |                    0 |
|  0%   23C    P8    14W / 300W |      3MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A40          On   | 00000000:C3:00.0 Off |                    0 |
|  0%   22C    P8    15W / 300W |      3MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A40          On   | 00000000:C4:00.0 Off |                    0 |
|  0%   21C    P8    14W / 300W |      3MiB / 46068MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    2   N/A  N/A   2111354      C   ...eratornet/venv/bin/python      818MiB |
+-----------------------------------------------------------------------------+

jelmd avatar Feb 27 '23 15:02 jelmd

@jelmd +1, I ran into this problem again. Hi @aritger @Joshua-Ashton maybe this is a Driver issue, please take a look.

Fri Mar  3 14:41:55 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A30          Off  | 00000000:01:00.0 Off |                    0 |
| N/A   26C    P0    29W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A30          Off  | 00000000:22:00.0 Off |                    0 |
| N/A   40C    P0    92W / 165W |  16096MiB / 24576MiB |     88%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A30          Off  | 00000000:41:00.0 Off |                    0 |
| N/A   42C    P0   141W / 165W |  14992MiB / 24576MiB |     33%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   3  ERR!                Off  | 00000000:61:00.0 Off |                 ERR! |
|ERR!  ERR! ERR!    ERR! / ERR! |  23025MiB / 24576MiB |    ERR!      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
|   4  NVIDIA A30          Off  | 00000000:81:00.0 Off |                    0 |
| N/A   26C    P0    26W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   5  NVIDIA A30          Off  | 00000000:A1:00.0 Off |                    0 |
| N/A   26C    P0    28W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   6  NVIDIA A30          Off  | 00000000:C1:00.0 Off |                    0 |
| N/A   25C    P0    29W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
|   7  NVIDIA A30          Off  | 00000000:E1:00.0 Off |                    0 |
| N/A   24C    P0    25W / 165W |      0MiB / 24576MiB |      0%      Default |
|                               |                      |             Disabled |
+-------------------------------+----------------------+----------------------+
                                                                               
[Fri Mar  3 04:23:22 2023] NVRM: GPU at PCI:0000:61:00: GPU-e59ce3f9-af53-a0dd-1d2c-8beaa74aa635
[Fri Mar  3 04:23:22 2023] NVRM: GPU Board Serial Number: 1322621149782
[Fri Mar  3 04:23:22 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 04:23:22 2023] CPU: 72 PID: 1344368 Comm: nvidia-smi Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar  3 04:23:22 2023] Hardware name: Inspur NF5468A5/YZMB-02382-101, BIOS 4.02.12 01/28/2022
[Fri Mar  3 04:23:22 2023] Call Trace:
[Fri Mar  3 04:23:22 2023]  dump_stack+0x6b/0x83
[Fri Mar  3 04:23:22 2023]  _nv011231rm+0x39d/0x470 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv011168rm+0x62/0x2e0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv040022rm+0xdb/0x140 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041148rm+0x2ce/0x3a0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv015451rm+0x788/0x800 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039541rm+0xac/0xe0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041150rm+0xac/0x140 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv041149rm+0x37a/0x4d0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039443rm+0xc9/0x150 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv039444rm+0x42/0x70 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv000554rm+0x49/0x60 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? _nv000694rm+0x7fb/0xc80 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar  3 04:23:22 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar  3 04:23:22 2023]  ? do_syscall_64+0x33/0x80
[Fri Mar  3 04:23:22 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar  3 04:24:07 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 04:24:52 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1344368, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 04:25:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 04:26:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 04:27:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 04:27:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 04:28:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 04:29:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:30:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar  3 04:30:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:31:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:32:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:33:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:33:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar  3 04:34:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar  3 04:35:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar  3 04:36:03 2023] INFO: task nvidia-smi:1346229 blocked for more than 120 seconds.
[Fri Mar  3 04:36:03 2023]       Tainted: P           OE     5.10.0-20-amd64 #1 Debian 5.10.158-2
[Fri Mar  3 04:36:03 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Mar  3 04:36:03 2023] task:nvidia-smi      state:D stack:    0 pid:1346229 ppid:1346228 flags:0x00000000
[Fri Mar  3 04:36:03 2023] Call Trace:
[Fri Mar  3 04:36:03 2023]  __schedule+0x282/0x880
[Fri Mar  3 04:36:03 2023]  ? rwsem_spin_on_owner+0x74/0xd0
[Fri Mar  3 04:36:03 2023]  schedule+0x46/0xb0
[Fri Mar  3 04:36:03 2023]  rwsem_down_write_slowpath+0x246/0x4d0
[Fri Mar  3 04:36:03 2023]  os_acquire_rwlock_write+0x31/0x40 [nvidia]
[Fri Mar  3 04:36:03 2023]  _nv038505rm+0xc/0x30 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039453rm+0x18d/0x1d0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv041182rm+0x45/0xd0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv041127rm+0x142/0x2b0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039415rm+0x15a/0x2e0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039416rm+0x5b/0x90 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv039416rm+0x31/0x90 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000559rm+0x5a/0x70 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000559rm+0x33/0x70 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? _nv000694rm+0x94a/0xc80 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? rm_ioctl+0x54/0xb0 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? nvidia_ioctl+0x6cd/0x830 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? nvidia_frontend_unlocked_ioctl+0x37/0x50 [nvidia]
[Fri Mar  3 04:36:03 2023]  ? __x64_sys_ioctl+0x8b/0xc0
[Fri Mar  3 04:36:03 2023]  ? do_syscall_64+0x33/0x80
[Fri Mar  3 04:36:03 2023]  ? entry_SYSCALL_64_after_hwframe+0x61/0xc6
[Fri Mar  3 04:36:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar  3 04:36:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar  3 04:37:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar  3 04:38:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:39:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:39:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:40:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar  3 04:41:23 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar  3 04:42:08 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 04:42:53 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 04:43:38 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1346198, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 04:44:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 04:45:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 04:45:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 04:46:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 04:47:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 04:48:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:48:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).
[Fri Mar  3 04:49:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:50:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:51:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:51:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 04:52:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080852e 0x208).
[Fri Mar  3 04:53:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20808513 0x598).
[Fri Mar  3 04:54:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a068 0x4).
[Fri Mar  3 04:54:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a618 0x181c).
[Fri Mar  3 04:55:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080a612 0xd98).
[Fri Mar  3 04:56:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20809009 0x8).
[Fri Mar  3 04:57:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:57:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:58:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800a4c 0x4).
[Fri Mar  3 04:59:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x208f 0x0).
[Fri Mar  3 05:00:09 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x208f1105 0x8).
[Fri Mar  3 05:00:54 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a00a0 0x0).
[Fri Mar  3 05:01:39 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0030 0x0).
[Fri Mar  3 05:02:24 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1365232, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 10 (FREE) (0xa55a0020 0x0).
[Fri Mar  3 05:03:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20801348 0x410).
[Fri Mar  3 05:03:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x0 0x6c).
[Fri Mar  3 05:04:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x80 0x38).
[Fri Mar  3 05:05:26 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 103 (GSP_RM_ALLOC) (0x2080 0x4).
[Fri Mar  3 05:06:11 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800110 0x84).
[Fri Mar  3 05:06:56 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x2080014b 0x5).
[Fri Mar  3 05:07:41 2023] NVRM: Xid (PCI:0000:61:00): 119, pid=1385412, name=nvidia-smi, Timeout waiting for RPC from GSP! Expected function 76 (GSP_RM_CONTROL) (0x20800157 0x0).

fighterhit avatar Mar 03 '23 06:03 fighterhit

Also happening here on a A100-PCIE-40GB using driver 530.30.02 and CUDA 12.1.

lpla avatar Mar 03 '23 12:03 lpla

Hi @lpla , what's your use case environment? In kubernetes?

fighterhit avatar Mar 03 '23 13:03 fighterhit

There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.

lpla avatar Mar 03 '23 13:03 lpla

There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.

Have you tried the 520.* driver ? Is it work?

fighterhit avatar Mar 03 '23 13:03 fighterhit

FWIW: Most of our users use PyTorch as well. Perhaps it tortures GPUs too hard ;-)

jelmd avatar Mar 03 '23 13:03 jelmd

We also use PyTorch on the GPUs, but the 470 driver used before has been more stable.

fighterhit avatar Mar 03 '23 13:03 fighterhit

Yepp. np with 470, too.

jelmd avatar Mar 03 '23 13:03 jelmd

There is no environment. It triggered the bug several times in both 525 and 530 driver. It is a Machine Learning inference command line written in PyTorch.

Have you tried the 520.* driver ? Is it work?

That's my next test. In fact, that's exactly the version I was using before upgrading from Ubuntu 20.04 with kernel 5.15 and driver 520 to Ubuntu 22.04 with kernel 5.19 and driver 525 last month. It was working perfect with that previous setup.

lpla avatar Mar 03 '23 13:03 lpla

Same thing on another machine. FWIW: I removed /usr/lib/firmware/nvidia/525.60.13 - perhaps this fixes the problem.

jelmd avatar Mar 07 '23 08:03 jelmd

UPDATE

They have confirmed Xid 119 this bug. They said that the GSP feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.


Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the nvidia-bug-report.log.gz they advised us to disable GSP-RM.

  1. To disable GSP-RM:
sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'
  1. Enable the kernel
# if ubuntu
sudo update-initramfs -u

# if centos
dracut -f 
  1. Reboot
  2. Check if work. If EnableGpuFirmware: 0 then GSP-RM is disabled.
cat /proc/driver/nvidia/params | grep EnableGpuFirmware

Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)

fighterhit avatar Mar 07 '23 08:03 fighterhit

I'm also seeing XID 119s on some 510 drivers. Have not tried 525 or 520.

stephenroller avatar Mar 08 '23 18:03 stephenroller

Driver 525.60.13 with A40 GSP disable.But nvidia-but-report also have GSP timeout error.

liming5619 avatar Mar 09 '23 10:03 liming5619

Hi @stephenroller @liming5619 , maybe it's better to downgrade the driver version. On one hand, the GSP feature was introduced by NVIDIA since 510 but has not been fixed yet. On the other hand, 470 is an LTS version and has been running stably in our production environment for a long time. I have already downgraded the driver of the problematic node to 470.82.01 to match our other production nodes, just for your reference. :)

fighterhit avatar Mar 09 '23 11:03 fighterhit

So far disabling GSP seems to have mitigated, but maybe I've just been lucky since. Will report back if I see counter-evidence.

stephenroller avatar Mar 10 '23 06:03 stephenroller

Yepp, removing /usr/lib/firmware/nvidia/5xx.* seems to fix the problem, too (did not use NVreg_EnableGpuFirmware=0).

jelmd avatar Mar 10 '23 19:03 jelmd

UPDATE

They have confirmed Xid 119 this bug. They said that the GSP feature was introduced from version 510, but it has not been fixed yet. They only gave the method of disabling it mentioned below or suggested that we downgrade the version to <510(e.g. 470) so that it is more stable.

Hi @jelmd @lpla , as the NVIDIA customer, we communicate with the NVIDIA support team today and according to the nvidia-bug-report.log.gz they advised us to disable GSP-RM.

1. To disable GSP-RM:
sudo su -c 'echo options nvidia NVreg_EnableGpuFirmware=0 > /etc/modprobe.d/nvidia-gsp.conf'
2. Enable the kernel
# if ubuntu
sudo update-initramfs -u

# if centos
dracut -f 
3. Reboot

4. Check if work. If `EnableGpuFirmware: 0` then `GSP-RM` is disabled.
cat /proc/driver/nvidia/params | grep EnableGpuFirmware

Since our problem node still has tasks running, I haven't tried it yet, I will try this method tonight or tomorrow morning, just for reference. :)

We disable it on 2 hosts - 8x A100 GPUs, If this workaround will work, I will also give a feedback.

edzoe avatar Mar 20 '23 15:03 edzoe

Feedback: After a week, I can say all servers with the A100 boards are running stable after we disabled the GSP. No GPU crashes anymore.

@fighterhit thank you for sharing the workaround with us.

edzoe avatar Mar 27 '23 11:03 edzoe

I have a similar issue, after disabling GSP, It took more than 5 minutes to give output as "True".

# cat /etc/modprobe.d/nvidia-gsp.conf
options nvidia NVreg_EnableGpuFirmware=0
# cat /proc/driver/nvidia/params | grep EnableGpuFirmware
EnableGpuFirmware: 0
EnableGpuFirmwareLogs: 2

Strange thing is, I'm booting up vms with images which has GPU driver pre-installed, on a host with 4 cards, 2 out of 4 cards ends up with a similar issue.

Please suggest a fix as its hampering our prod environments. Please let me know if any additional commands or log output that i should provide.

We also have few requirements based on CUDA 11.8 and hence we cannot roll back to driver 470

mdrasheek avatar Apr 12 '23 15:04 mdrasheek

@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.

yskan avatar Apr 12 '23 15:04 yskan

@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.

Before disabling GSP the error was same as in this post:

NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL
OC) (0x0 0x6c).

But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".

mdrasheek avatar Apr 12 '23 15:04 mdrasheek

@mdrasheek it could be that the driver config inside the VMs is overriding your changes to GSP. As it is, disabling GSP should resolve the problem. Check in your logs which xid error you are having.

Before disabling GSP the error was same as in this post:

NVRM: Xid (PCI:0000:01:01): 119, pid=8019, name=python3, Timeout waiting for RPC from GSP0! Expected function 103 (GSP_RM_ALL
OC) (0x0 0x6c).

But after disabling it, i couldn't find any logs, but it takes lot of time to say cuda is "True".

Is there something where I can enable trace, or increase the log level to know the cause of this delay?

mdrasheek avatar Apr 13 '23 06:04 mdrasheek