open-gpu-kernel-modules icon indicating copy to clipboard operation
open-gpu-kernel-modules copied to clipboard

Performance mode stays P0 despite compute being 0% for long duration, causing high idle power usage

Open georgeboot opened this issue 7 months ago • 3 comments

NVIDIA Open GPU Kernel Modules Version

570.133.20

Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.

  • [x] I confirm that this does not happen with the proprietary driver package.

Operating System and Version

talos-linux

Kernel Release

6.12.27-talos

Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.

  • [x] I am running on a stable kernel release.

Hardware: GPU

A40

Describe the bug

The GPU, without any processes on it, idles at around 30W which is normal for a A40.

When we load data into memory but not do anything with it (thus get a process running on the GPU but without actually computing anything – think loading a model in memory but not actively doing inference), the idle power is around 110w. This is because the compute mod stays at P0

==============NVSMI LOG==============

Timestamp                                 : Wed Jun  4 12:24:58 2025
Driver Version                            : 570.133.20
CUDA Version                              : 12.8

Attached GPUs                             : 1
GPU 00000000:41:00.0
    Product Name                          : NVIDIA A40
    Product Brand                         : NVIDIA
    Product Architecture                  : Ampere
    Display Mode                          : Enabled
    Display Active                        : Disabled
    Persistence Mode                      : Disabled
    Addressing Mode                       : HMM
    MIG Mode
        Current                           : N/A
        Pending                           : N/A
    Accounting Mode                       : Disabled
    Accounting Mode Buffer Size           : 4000
    Driver Model
        Current                           : N/A
        Pending                           : N/A
    Serial Number                         : 1324722026424
    GPU UUID                              : GPU-ce5033ab-f214-ba47-677b-3e03f81c241e
    Minor Number                          : 0
    VBIOS Version                         : 94.02.5C.00.0F
    MultiGPU Board                        : No
    Board ID                              : 0x4100
    Board Part Number                     : 900-2G133-0000-100
    GPU Part Number                       : 2235-895-A1
    FRU Part Number                       : N/A
    Platform Info
        Chassis Serial Number             : N/A
        Slot Number                       : N/A
        Tray Index                        : N/A
        Host ID                           : N/A
        Peer Type                         : N/A
        Module Id                         : 1
        GPU Fabric GUID                   : N/A
    Inforom Version
        Image Version                     : G133.0200.00.05
        OEM Object                        : 2.0
        ECC Object                        : 6.16
        Power Management Object           : N/A
    Inforom BBX Object Flush
        Latest Timestamp                  : N/A
        Latest Duration                   : N/A
    GPU Operation Mode
        Current                           : N/A
        Pending                           : N/A
    GPU C2C Mode                          : N/A
    GPU Virtualization Mode
        Virtualization Mode               : None
        Host VGPU Mode                    : N/A
        vGPU Heterogeneous Mode           : N/A
    GPU Reset Status
        Reset Required                    : Requested functionality has been deprecated
        Drain and Reset Recommended       : Requested functionality has been deprecated
    GPU Recovery Action                   : None
    GSP Firmware Version                  : 570.133.20
    IBMNPU
        Relaxed Ordering Mode             : N/A
    PCI
        Bus                               : 0x41
        Device                            : 0x00
        Domain                            : 0x0000
        Base Classcode                    : 0x3
        Sub Classcode                     : 0x2
        Device Id                         : 0x223510DE
        Bus Id                            : 00000000:41:00.0
        Sub System Id                     : 0x145A10DE
        GPU Link Info
            PCIe Generation
                Max                       : 4
                Current                   : 4
                Device Current            : 4
                Device Max                : 4
                Host Max                  : 5
            Link Width
                Max                       : 16x
                Current                   : 16x
        Bridge Chip
            Type                          : N/A
            Firmware                      : N/A
        Replays Since Reset               : 0
        Replay Number Rollovers           : 0
        Tx Throughput                     : 450 KB/s
        Rx Throughput                     : 400 KB/s
        Atomic Caps Outbound              : N/A
        Atomic Caps Inbound               : N/A
    Fan Speed                             : 0 %
    Performance State                     : P0
    Clocks Event Reasons
        Idle                              : Not Active
        Applications Clocks Setting       : Not Active
        SW Power Cap                      : Not Active
        HW Slowdown                       : Not Active
            HW Thermal Slowdown           : Not Active
            HW Power Brake Slowdown       : Not Active
        Sync Boost                        : Not Active
        SW Thermal Slowdown               : Not Active
        Display Clock Setting             : Not Active
    Sparse Operation Mode                 : N/A
    FB Memory Usage
        Total                             : 46068 MiB
        Reserved                          : 569 MiB
        Used                              : 20294 MiB
        Free                              : 25206 MiB
    BAR1 Memory Usage
        Total                             : 65536 MiB
        Used                              : 18 MiB
        Free                              : 65518 MiB
    Conf Compute Protected Memory Usage
        Total                             : 0 MiB
        Used                              : 0 MiB
        Free                              : 0 MiB
    Compute Mode                          : Default
    Utilization
        GPU                               : 0 %
        Memory                            : 0 %
        Encoder                           : 0 %
        Decoder                           : 0 %
        JPEG                              : 0 %
        OFA                               : 0 %
    Encoder Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    FBC Stats
        Active Sessions                   : 0
        Average FPS                       : 0
        Average Latency                   : 0
    DRAM Encryption Mode
        Current                           : N/A
        Pending                           : N/A
    ECC Mode
        Current                           : Enabled
        Pending                           : Enabled
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable Parity     : 0
            SRAM Uncorrectable SEC-DED    : 0
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
            SRAM Threshold Exceeded       : No
        Aggregate Uncorrectable SRAM Sources
            SRAM L2                       : 0
            SRAM SM                       : 0
            SRAM Microcontroller          : 0
            SRAM PCIE                     : 0
            SRAM Other                    : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 0
        Uncorrectable Error               : 0
        Pending                           : No
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 192 bank(s)
            High                          : 0 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
    Temperature
        GPU Current Temp                  : 57 C
        GPU T.Limit Temp                  : N/A
        GPU Shutdown Temp                 : 98 C
        GPU Slowdown Temp                 : 95 C
        GPU Max Operating Temp            : 88 C
        GPU Target Temperature            : N/A
        Memory Current Temp               : N/A
        Memory Max Operating Temp         : N/A
    GPU Power Readings
        Average Power Draw                : 112.81 W
        Instantaneous Power Draw          : 112.76 W
        Current Power Limit               : 300.00 W
        Requested Power Limit             : 300.00 W
        Default Power Limit               : 300.00 W
        Min Power Limit                   : 100.00 W
        Max Power Limit                   : 300.00 W
    GPU Memory Power Readings 
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
    Module Power Readings
        Average Power Draw                : N/A
        Instantaneous Power Draw          : N/A
        Current Power Limit               : N/A
        Requested Power Limit             : N/A
        Default Power Limit               : N/A
        Min Power Limit                   : N/A
        Max Power Limit                   : N/A
    Power Smoothing                       : N/A
    Workload Power Profiles
        Requested Profiles                : N/A
        Enforced Profiles                 : N/A
    Clocks
        Graphics                          : 1740 MHz
        SM                                : 1740 MHz
        Memory                            : 7251 MHz
        Video                             : 1530 MHz
    Applications Clocks
        Graphics                          : 1740 MHz
        Memory                            : 7251 MHz
    Default Applications Clocks
        Graphics                          : 1740 MHz
        Memory                            : 7251 MHz
    Deferred Clocks
        Memory                            : N/A
    Max Clocks
        Graphics                          : 1740 MHz
        SM                                : 1740 MHz
        Memory                            : 7251 MHz
        Video                             : 1530 MHz
    Max Customer Boost Clocks
        Graphics                          : 1740 MHz
    Clock Policy
        Auto Boost                        : N/A
        Auto Boost Default                : N/A
    Voltage
        Graphics                          : N/A
    Fabric
        State                             : N/A
        Status                            : N/A
        CliqueId                          : N/A
        ClusterUUID                       : N/A
        Health
            Bandwidth                     : N/A
            Route Recovery in progress    : N/A
            Route Unhealthy               : N/A
            Access Timeout Recovery       : N/A
    Processes                             : None
    Capabilities
        EGM                               : disabled

To Reproduce

  1. Start a new process
  2. Load memory into the GPU
  3. Observe power state hang at P0

Bug Incidence

Always

nvidia-bug-report.log.gz

Can't run that (to my knowledge) on talos since the whole OS is read-only

More Info

No response

georgeboot avatar Jun 04 '25 12:06 georgeboot

This is a known bug of Nvidia drivers when CUDA processes are running:

  • https://forums.developer.nvidia.com/t/nvdec-forces-gpu-into-p2-cuda-state-much-higher-power-consumption-than-with-vdpau/55466
  • https://forums.developer.nvidia.com/t/remove-p2-forced-state-from-drivers/241998

thesword53 avatar Jun 07 '25 17:06 thesword53

Hmm I see. The strange thing is, that we previously used Ubuntu + proprietary drivers and did not have the issue. We recently switched to Talos Linux + OSS drivers, and now it's an issue. Our workloads (container deployments) are exactly the same (same container tag)

georgeboot avatar Jun 08 '25 07:06 georgeboot

@georgeboot

we previously used Ubuntu + proprietary drivers and did not have the issue. We recently switched to Talos Linux + OSS drivers, and now it's an issue

It could be that your older proprietary drivers didn't utilize the GSP, and thus this functionality was different in the older proprietary stack, whereas the OSS drivers (and newer proprietary drivers as well, afaik) always utilize the GSP.

Ristovski avatar Jul 05 '25 19:07 Ristovski