Performance mode stays P0 despite compute being 0% for long duration, causing high idle power usage
NVIDIA Open GPU Kernel Modules Version
570.133.20
Please confirm this issue does not happen with the proprietary driver (of the same version). This issue tracker is only for bugs specific to the open kernel driver.
- [x] I confirm that this does not happen with the proprietary driver package.
Operating System and Version
talos-linux
Kernel Release
6.12.27-talos
Please confirm you are running a stable release kernel (e.g. not a -rc). We do not accept bug reports for unreleased kernels.
- [x] I am running on a stable kernel release.
Hardware: GPU
A40
Describe the bug
The GPU, without any processes on it, idles at around 30W which is normal for a A40.
When we load data into memory but not do anything with it (thus get a process running on the GPU but without actually computing anything – think loading a model in memory but not actively doing inference), the idle power is around 110w. This is because the compute mod stays at P0
==============NVSMI LOG==============
Timestamp : Wed Jun 4 12:24:58 2025
Driver Version : 570.133.20
CUDA Version : 12.8
Attached GPUs : 1
GPU 00000000:41:00.0
Product Name : NVIDIA A40
Product Brand : NVIDIA
Product Architecture : Ampere
Display Mode : Enabled
Display Active : Disabled
Persistence Mode : Disabled
Addressing Mode : HMM
MIG Mode
Current : N/A
Pending : N/A
Accounting Mode : Disabled
Accounting Mode Buffer Size : 4000
Driver Model
Current : N/A
Pending : N/A
Serial Number : 1324722026424
GPU UUID : GPU-ce5033ab-f214-ba47-677b-3e03f81c241e
Minor Number : 0
VBIOS Version : 94.02.5C.00.0F
MultiGPU Board : No
Board ID : 0x4100
Board Part Number : 900-2G133-0000-100
GPU Part Number : 2235-895-A1
FRU Part Number : N/A
Platform Info
Chassis Serial Number : N/A
Slot Number : N/A
Tray Index : N/A
Host ID : N/A
Peer Type : N/A
Module Id : 1
GPU Fabric GUID : N/A
Inforom Version
Image Version : G133.0200.00.05
OEM Object : 2.0
ECC Object : 6.16
Power Management Object : N/A
Inforom BBX Object Flush
Latest Timestamp : N/A
Latest Duration : N/A
GPU Operation Mode
Current : N/A
Pending : N/A
GPU C2C Mode : N/A
GPU Virtualization Mode
Virtualization Mode : None
Host VGPU Mode : N/A
vGPU Heterogeneous Mode : N/A
GPU Reset Status
Reset Required : Requested functionality has been deprecated
Drain and Reset Recommended : Requested functionality has been deprecated
GPU Recovery Action : None
GSP Firmware Version : 570.133.20
IBMNPU
Relaxed Ordering Mode : N/A
PCI
Bus : 0x41
Device : 0x00
Domain : 0x0000
Base Classcode : 0x3
Sub Classcode : 0x2
Device Id : 0x223510DE
Bus Id : 00000000:41:00.0
Sub System Id : 0x145A10DE
GPU Link Info
PCIe Generation
Max : 4
Current : 4
Device Current : 4
Device Max : 4
Host Max : 5
Link Width
Max : 16x
Current : 16x
Bridge Chip
Type : N/A
Firmware : N/A
Replays Since Reset : 0
Replay Number Rollovers : 0
Tx Throughput : 450 KB/s
Rx Throughput : 400 KB/s
Atomic Caps Outbound : N/A
Atomic Caps Inbound : N/A
Fan Speed : 0 %
Performance State : P0
Clocks Event Reasons
Idle : Not Active
Applications Clocks Setting : Not Active
SW Power Cap : Not Active
HW Slowdown : Not Active
HW Thermal Slowdown : Not Active
HW Power Brake Slowdown : Not Active
Sync Boost : Not Active
SW Thermal Slowdown : Not Active
Display Clock Setting : Not Active
Sparse Operation Mode : N/A
FB Memory Usage
Total : 46068 MiB
Reserved : 569 MiB
Used : 20294 MiB
Free : 25206 MiB
BAR1 Memory Usage
Total : 65536 MiB
Used : 18 MiB
Free : 65518 MiB
Conf Compute Protected Memory Usage
Total : 0 MiB
Used : 0 MiB
Free : 0 MiB
Compute Mode : Default
Utilization
GPU : 0 %
Memory : 0 %
Encoder : 0 %
Decoder : 0 %
JPEG : 0 %
OFA : 0 %
Encoder Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
FBC Stats
Active Sessions : 0
Average FPS : 0
Average Latency : 0
DRAM Encryption Mode
Current : N/A
Pending : N/A
ECC Mode
Current : Enabled
Pending : Enabled
ECC Errors
Volatile
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
Aggregate
SRAM Correctable : 0
SRAM Uncorrectable Parity : 0
SRAM Uncorrectable SEC-DED : 0
DRAM Correctable : 0
DRAM Uncorrectable : 0
SRAM Threshold Exceeded : No
Aggregate Uncorrectable SRAM Sources
SRAM L2 : 0
SRAM SM : 0
SRAM Microcontroller : 0
SRAM PCIE : 0
SRAM Other : 0
Retired Pages
Single Bit ECC : N/A
Double Bit ECC : N/A
Pending Page Blacklist : N/A
Remapped Rows
Correctable Error : 0
Uncorrectable Error : 0
Pending : No
Remapping Failure Occurred : No
Bank Remap Availability Histogram
Max : 192 bank(s)
High : 0 bank(s)
Partial : 0 bank(s)
Low : 0 bank(s)
None : 0 bank(s)
Temperature
GPU Current Temp : 57 C
GPU T.Limit Temp : N/A
GPU Shutdown Temp : 98 C
GPU Slowdown Temp : 95 C
GPU Max Operating Temp : 88 C
GPU Target Temperature : N/A
Memory Current Temp : N/A
Memory Max Operating Temp : N/A
GPU Power Readings
Average Power Draw : 112.81 W
Instantaneous Power Draw : 112.76 W
Current Power Limit : 300.00 W
Requested Power Limit : 300.00 W
Default Power Limit : 300.00 W
Min Power Limit : 100.00 W
Max Power Limit : 300.00 W
GPU Memory Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Module Power Readings
Average Power Draw : N/A
Instantaneous Power Draw : N/A
Current Power Limit : N/A
Requested Power Limit : N/A
Default Power Limit : N/A
Min Power Limit : N/A
Max Power Limit : N/A
Power Smoothing : N/A
Workload Power Profiles
Requested Profiles : N/A
Enforced Profiles : N/A
Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Default Applications Clocks
Graphics : 1740 MHz
Memory : 7251 MHz
Deferred Clocks
Memory : N/A
Max Clocks
Graphics : 1740 MHz
SM : 1740 MHz
Memory : 7251 MHz
Video : 1530 MHz
Max Customer Boost Clocks
Graphics : 1740 MHz
Clock Policy
Auto Boost : N/A
Auto Boost Default : N/A
Voltage
Graphics : N/A
Fabric
State : N/A
Status : N/A
CliqueId : N/A
ClusterUUID : N/A
Health
Bandwidth : N/A
Route Recovery in progress : N/A
Route Unhealthy : N/A
Access Timeout Recovery : N/A
Processes : None
Capabilities
EGM : disabled
To Reproduce
- Start a new process
- Load memory into the GPU
- Observe power state hang at P0
Bug Incidence
Always
nvidia-bug-report.log.gz
Can't run that (to my knowledge) on talos since the whole OS is read-only
More Info
No response
This is a known bug of Nvidia drivers when CUDA processes are running:
- https://forums.developer.nvidia.com/t/nvdec-forces-gpu-into-p2-cuda-state-much-higher-power-consumption-than-with-vdpau/55466
- https://forums.developer.nvidia.com/t/remove-p2-forced-state-from-drivers/241998
Hmm I see. The strange thing is, that we previously used Ubuntu + proprietary drivers and did not have the issue. We recently switched to Talos Linux + OSS drivers, and now it's an issue. Our workloads (container deployments) are exactly the same (same container tag)
@georgeboot
we previously used Ubuntu + proprietary drivers and did not have the issue. We recently switched to Talos Linux + OSS drivers, and now it's an issue
It could be that your older proprietary drivers didn't utilize the GSP, and thus this functionality was different in the older proprietary stack, whereas the OSS drivers (and newer proprietary drivers as well, afaik) always utilize the GSP.