open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0

Open zheng199512 opened this issue 1 month ago • 2 comments

NVIDIA Open GPU Kernel Modules Version

[root@A11-R42-I61-42-5504045 ~]# cat /proc/driver/nvidia/params ResmanDebugLevel: 4294967295 RmLogonRC: 1 ModifyDeviceFiles: 1 DeviceFileUID: 0 DeviceFileGID: 0 DeviceFileMode: 438 InitializeSystemMemoryAllocations: 1 UsePageAttributeTable: 4294967295 EnableMSI: 1 EnablePCIeGen3: 0 MemoryPoolSize: 0 KMallocHeapMaxSize: 0 VMallocHeapMaxSize: 0 IgnoreMMIOCheck: 0 EnableStreamMemOPs: 0 EnableUserNUMAManagement: 1 NvLinkDisable: 0 RmProfilingAdminOnly: 1 PreserveVideoMemoryAllocations: 0 EnableS0ixPowerManagement: 0 S0ixPowerManagementVideoMemoryThreshold: 256 DynamicPowerManagement: 3 DynamicPowerManagementVideoMemoryThreshold: 200 RegisterPCIDriver: 1 EnablePCIERelaxedOrderingMode: 0 EnableResizableBar: 0 EnableGpuFirmware: 18 EnableGpuFirmwareLogs: 2 RmNvlinkBandwidthLinkCount: 0 EnableDbgBreakpoint: 0 OpenRmEnableUnsupportedGpus: 1 DmaRemapPeerMmio: 1 ImexChannelCount: 2048 CreateImexChannel0: 0 GrdmaPciTopoCheckOverride: 0 RegistryDwords: "" RegistryDwordsPerDevice: "" RmMsg: "" GpuBlacklist: "" TemporaryFilePath: "" ExcludedGpus: ""

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [ ] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

[root@A11-R42-I61-42-5504045 ~]# cat /etc/VesselOS-release VesselOS release 2.0 (LTS-SP2) [root@A11-R42-I61-42-5504045 ~]#

Kernel Release

[root@A11-R42-I61-42-5504045 ~]# uname -a Linux A11-R42-I61-42-5504045. 6.6.0-100. SMP Fri Aug 22 10:50:04 CST 2025 x86_64 x86_64 x86_64 GNU/Linux [root@A11-R42-I61-42-5504045 ~]# uname -r 6.6.0-100

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [ ] I am running on a stable kernel release.

Hardware: GPU

B200

Describe the bug

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200

[root@A11-R42-I61-42-5504045 ~]# dmesg -T | grep -i nvrm | head -n 10 [Sat Nov 22 05:08:50 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:50 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:08:54 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:54 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:08:58 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:08:58 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [Sat Nov 22 05:09:02 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:09:02 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer0's postRxDetLinkMask failed! [Sat Nov 22 05:09:06 2025] NVRM: knvlinkUpdatePostRxDetectLinkMask_IMPL: Failed to update Rx Detect Link mask! [Sat Nov 22 05:09:06 2025] NVRM: knvlinkDiscoverPostRxDetLinks_GH100: Getting peer1's postRxDetLinkMask failed! [root@A11-R42-I61-42-5504045 ~]#

[root@A11-R42-I61-42-5504045 ~]# uptime 22:50:02 up 67 days, 6:11, 2 users, load average: 17.40, 16.73, 18.67 [root@A11-R42-I61-42-5504045 ~]# last reboot reboot system boot 6.6.0-100. Tue Sep 16 16:38 still running reboot system boot 6.6.0-100 Tue Sep 9 17:02 - 16:34 (6+23:32)

To Reproduce

nvidia-smi hangs indefinitely after ~66 days 12 hours uptime with driver 570.133.20 OpenRM on B200 and kernel 6.6.0

Bug Incidence

Once

nvidia-bug-report.log.gz

no

More Info

No response

zheng199512 avatar Nov 22 '25 14:11 zheng199512

Hey there. Thanks for the report, but unfortunately there's not a whole lot to go on there without the logs. Was nvidia-bug-report.sh unable to run and collect the logs, or are you unable to post them? Or at least the full dmesg. If it's a matter of not posting them publicly, feel free to email to [email protected]. Also feel free to redact individual info if you want and just leave a [redacted] marker.

Other than that, it would also help to know:

  • How was nvidia-smi running? Was it called with --loop argument and then suddenly hung, or was it constantly invoked on a timer and then suddenly stopped responding?
  • Was anything else using the GPU at this time, and did it also hang?
  • Was persistence mode active?
  • Stack traces from nvidia-smi and any threads inside the NV kernel modules would be very helpful

mtijanic avatar Nov 24 '25 09:11 mtijanic

The nvidia-bug-report.sh script will run very slowly.

Many programs use nvidia-smi, such as nvidia-smi topo -p2p r, nvidia-smi -L, and some other programs that use the GPU. Yes, they will run so slowly that they may not function properly.

nvidia-smi -pm 1 is already set.

dmesg.log

strace_nvidia_smi.log

GPU 00000000:E8:00.0
    Product Name                          : NVIDIA B200
    Product Brand                         : NVIDIA
    Product Architecture                  : Blackwell
    Display Mode                          : Disabled
    Display Active                        : Disabled
    Persistence Mode                      : Enabled
    Addressing Mode                       : HMM
    MIG Mode
        Current                           : Disabled
        Pending                           : Disabled
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1650625080736
    GPU UUID                              : GPU-3018485e-e26c-2048-4769-22a62105dea4
    Minor Number                          : 7
    VBIOS Version                         : 97.00.9A.00.0F
    MultiGPU Board                        : No
    Board ID                              : 0xe800
    Board Part Number                     : 692-2G525-0220-000
    GPU Part Number                       : 2901-886-A1
    FRU Part Number                       : N/A
    Platform Info
        Chassis Serial Number             : Unknown Error
        Slot Number                       : Unknown Error
        Tray Index                        : Unknown Error
        Host ID                           : Unknown Error
        Peer Type                         : Unknown Error
        Module Id                         : Unknown Error
        GPU Fabric GUID                   : Unknown Error
    Inforom Version
        Image Version                     : G525.0220.00.03
        OEM Object                        : 2.1
        ECC Object                        : 7.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : 2025/11/26 04:05:40.817
        Latest Duration                   : 57730 us
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : Disabled
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : Requested functionality has been deprecated
        Drain and Reset Recommended       : Requested functionality has been deprecated
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 570.133.20
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0xE8
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x290110DE
        Bus Id                            : 00000000:E8:00.0
        Sub System Id                     : 0x199910DE
        GPU Link Info
            PCIe Generation
                Max                       : 5
                Current                   : 5
                Device Current            : 5
                Device Max                : 5
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 946 KB/s
        Rx Throughput                     : 5249 KB/s
        Atomic Caps Outbound              : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
        Atomic Caps Inbound               : FETCHADD_32 FETCHADD_64 SWAP_32 SWAP_64 CAS_32 CAS_64
    Fan Speed                             : N/A
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 183359 MiB
        Reserved                          : 717 MiB
        Used                              : 0 MiB
        Free                              : 182643 MiB
    BAR1 Memory Usage
        Total                             : 262144 MiB
        Used                              : 1 MiB
        Free                              : 262143 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        GPU                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    DRAM Encryption Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 3840 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 30 C
        GPU T.Limit Temp                  : 58 C
        GPU Shutdown T.Limit Temp         : -5 C
        GPU Slowdown T.Limit Temp         : -3 C
        GPU Max Operating T.Limit Temp    : 0 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : 30 C
        Memory Max Operating T.Limit Temp : 0 C
    GPU Power Readings
        Average Power Draw                : 141.02 W
        Instantaneous Power Draw          : 141.32 W
        Current Power Limit               : 1000.00 W
        Requested Power Limit             : 1000.00 W
        Default Power Limit               : 1000.00 W
        Min Power Limit                   : 200.00 W
        Max Power Limit                   : 1000.00 W
    GPU Memory Power Readings
        Average Power Draw                : 19.53 W
        Instantaneous Power Draw          : N/A
    Module Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Smoothing                       : Insufficient Permissions
    Workload Power Profiles
        Requested Profiles                : N/A
        Enforced Profiles                 : N/A
    Clocks
        Graphics                          : 120 MHz
        SM                                : 120 MHz
        Memory                            : 3996 MHz
        Video                             : 600 MHz
    Applications Clocks
        Graphics                          : 1965 MHz
        Memory                            : 3996 MHz
    Default Applications Clocks
        Graphics                          : 1965 MHz
        Memory                            : 3996 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1965 MHz
        SM                                : 1965 MHz
        Memory                            : 3996 MHz
        Video                             : 1965 MHz
    Max Customer Boost Clocks
        Graphics                          : 1965 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : Completed
        Status                            : Success
        CliqueId                          : 0
        ClusterUUID                       : 00000000-0000-0000-0000-000000000000
        Health
            Bandwidth                     : N/A
            Route Recovery in progress    : N/A
            Route Unhealthy               : N/A
            Access Timeout Recovery       : False
    Processes                             : None
    Capabilities
        EGM                               : disabled

[test1001@A11-R42-I81-2-5840034 6.1.81.2_2025-11-26]$

zheng199512 avatar Nov 24 '25 13:11 zheng199512

@mtijanic , is this issue still open to work ?

Deadshot0x7 avatar Dec 08 '25 13:12 Deadshot0x7