nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0
NVIDIA Open GPU Kernel Modules Version
[root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID: 0 DeviceFileMode: 438 InitializeSystemMemoryAllocations: 1 UsePageAttributeTable: 4294967295 EnableMSI: 1 EnablePCIeGen3: 0 MemoryPoolSize: 0 KMallocHeapMaxSize: 0 VMallocHeapMaxSize: 0 IgnoreMMIOCheck: 0 EnableStreamMemOPs: 0 EnableUserNUMAManagement: 1 NvLinkDisable: 0 RmProfilingAdminOnly: 1 PreserveVideoMemoryAllocations: 0 EnableS0ixPowerManagement: 0 S0ixPowerManagementVideoMemoryThreshold: 256 DynamicPowerManagement: 3 DynamicPowerManagementVideoMemoryThreshold: 200 RegisterPCIDriver: 1 EnablePCIERelaxedOrderingMode: 0 EnableResizableBar: 0 EnableGpuFirmware: 18 EnableGpuFirmwareLogs: 2 RmNvlinkBandwidthLinkCount: 0 EnableDbgBreakpoint: 0 OpenRmEnableUnsupportedGpus: 1 DmaRemapPeerMmio: 1 ImexChannelCount: 2048 CreateImexChannel0: 0 GrdmaPciTopoCheckOverride: 0 RegistryDwords: "" RegistryDwordsPerDevice: "" RmMsg: "" GpuBlacklist: "" TemporaryFilePath: "" ExcludedGpus: ""
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [ ] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
[root@A11-R42-I61-42-5504045 ~]# cat /etc/VesselOS-release VesselOS release 2.0 (LTS-SP2) [root@A11-R42-I61-42-5504045 ~]#
Kernel Release
[root@A11-R42-I61-42-5504045 ~]# uname -a Linux A11-R42-I61-42-5504045. 6.6.0-100. SMP Fri Aug 22 10:50:04 CST 2025 x86_64 x86_64 x86_64 GNU/Linux [root@A11-R42-I61-42-5504045 ~]# uname -r 6.6.0-100
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [ ] I am running on a stable kernel release.
Hardware: GPU
B200
Describe the bug
nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200
[root@A11-R42-I61-42-5504045 ~]# dmesg -T | grep -i nvrm | head -n 10 [Sat Nov 22 05:08:50 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:50 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:08:54 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:54 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:08:58 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:58 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:09:02 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:09:02 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer0's postRxDetLinkMask failed! [Sat Nov 22 05:09:06 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:09:06 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [root@A11-R42-I61-42-5504045 ~]#
[root@A11-R42-I61-42-5504045 ~]# uptime 22:50:02 up 67 days, 6:11, 2 users, load average: 17.40, 16.73, 18.67 [root@A11-R42-I61-42-5504045 ~]# last reboot reboot system boot 6.6.0-100. Tue Sep 16 16:38 still running reboot system boot 6.6.0-100 Tue Sep 9 17:02 - 16:34 (6+23:32)
To Reproduce
nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0
Bug Incidence
Once
nvidia-bug-report.log.gz
no
More Info
No response
Hey there. Thanks for the report, but unfortunately there's not a whole lot to go on there without the logs. Was nvidia-bug-report.sh unable to run and collect the logs, or are you unable to post them? Or at least the full dmesg. If it's a matter of not posting them publicly, feel free to email to [email protected]. Also feel free to redact individual info if you want and just leave a [redacted] marker.
Other than that, it would also help to know:
- How was
nvidia-smirunning? Was it called with--loopargument and then suddenly hung, or was it constantly invoked on a timer and then suddenly stopped responding? - Was anything else using the GPU at this time, and did it also hang?
- Was persistence mode active?
- Stack traces from
nvidia-smiand any threads inside the NV kernel modules would be very helpful
The nvidia-bug-report.sh script will run very slowly.
Many programs use nvidia-smi, such as nvidia-smi topo -p2p r, nvidia-smi -L, and some other programs that use the GPU. Yes, they will run so slowly that they may not function properly.
nvidia-smi -pm 1 is already set.
GPU 00000000:E8:00.0
Product Name : NVIDIA B200
Product Brand : NVIDIA
Product Architecture : Blackwell
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Addressing Mode : HMM
MIG Mode
Current : Disabled
Pending : Disabled
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1650625080736
GPU UUID : GPU-3018485e-e26c-2048-4769-22a62105dea4
Minor Number : 7
VBIOS Version : 97.00.9A.00.0F
MultiGPU Board : No
Board ID : 0xe800
Board Part Number : 692-2G525-0220-000
GPU Part Number : 2901-886-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number : Unknown Error
Slot Number : Unknown Error
Tray Index : Unknown Error
Host ID : Unknown Error
Peer Type : Unknown Error
Module Id : Unknown Error
GPU Fabric GUID : Unknown Error
Inforom Version
Image Version : G525.0220.00.03
OEM Object : 2.1
ECC Object : 7.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : 2025/11/26 04:05:40.817
Latest Duration : 57730 us
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : Disabled
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : Requested functionality has been deprecated
Drain and Reset Recommended : Requested functionality has been deprecated
GPU Recovery Action : None
GSP Firmware Version : 570.133.20
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0xE8
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x2
Device Id : 0x290110DE
Bus Id : 00000000:E8:00.0
Sub System Id : 0x199910DE
GPU Link Info
PCIe Generation
Max : 5
Current : 5
Device Current : 5
Device Max : 5
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 946 KB/s
Rx Throughput : 5249 KB/s
Atomic Caps Outbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Atomic Caps Inbound : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
Fan Speed : N/A
Performance State : P0
Clocks Event Reasons
Idle : Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 183359 MiB
Reserved : 717 MiB
Used : 0 MiB
Free : 182643 MiB
BAR1 Memory Usage
Total : 262144 MiB
Used : 1 MiB
Free : 262143 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : N/A
Pending : N/A
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 3840 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 30 C
GPU T.Limit Temp : 58 C
GPU Shutdown T.Limit Temp : -5 C
GPU Slowdown T.Limit Temp : -3 C
GPU Max Operating T.Limit Temp : 0 C
GPU Target Temperature : N/A
Memory Current Temp : 30 C
Memory Max Operating T.Limit Temp : 0 C
GPU Power Readings
Average Power Draw : 141.02 W
Instantaneous Power Draw : 141.32 W
Current Power Limit : 1000.00 W
Requested Power Limit : 1000.00 W
Default Power Limit : 1000.00 W
Min Power Limit : 200.00 W
Max Power Limit : 1000.00 W
GPU Memory Power Readings
Average Power Draw : 19.53 W
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : Insufficient Permissions
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 120 MHz
SM : 120 MHz
Memory : 3996 MHz
Video : 600 MHz
Applications Clocks
Graphics : 1965 MHz
Memory : 3996 MHz
Default Applications Clocks
Graphics : 1965 MHz
Memory : 3996 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1965 MHz
SM : 1965 MHz
Memory : 3996 MHz
Video : 1965 MHz
Max Customer Boost Clocks
Graphics : 1965 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : Completed
Status : Success
CliqueId : 0
ClusterUUID : 00000000-0000-0000-0000-000000000000
Health
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : False
Processes : None
Capabilities
EGM : disabled
[test1001@A11-R42-I81-2-5840034 6.1.81.2_2025-11-26]$
@mtijanic , is this issue still open to work ?