gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

Auto Power-Limit Settings with Operator Policy

Open fenar opened this issue 4 months ago • 1 comments

Describe the Ask Many GPU(s) in the wild getting py-torched with high heat abuse and they die fast. GPU Operator shall offer a power limiting/regulating feature to avoid over-heat & burn.

To Reproduce Run your 600Watts Air/Poorly-Cooled GPUs next to each other for model train/fine-tune and observe accumulated temps surpassing >100C Temps.

Requested feature GPU Operator could able to monitor operating temperature of each GPU and keep it in allowed Temp Range, by using "nvidia-smi -pl" power limiting ability with autonomously if this feature is enabled.

Environment (please provide the following information):

  • GPU Operator Version: 25.3.2
  • Container Runtime Version: Driver Version: 570.148.08 CUDA Version: 12.8
  • Kubernetes Distro and Version: RH OCP 4.19

Information to attach (optional if deemed irrelevant)

sh-5.1# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:4F:00.0 Off |                  Off |
| 94%   97C    P0            583W /  600W |   47404MiB /  49140MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:F5:00.0 Off |                  Off |
| 30%   36C    P8             18W /  400W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           38730      C   python                                47394MiB |
+-----------------------------------------------------------------------------------------+
sh-5.1# nvidia-smi -pl 400 -i 0
Power limit for GPU 00000000:4F:00.0 was set to 400.00 W from 600.00 W.
All done.
<--Cool Down Period->
sh-5.1# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08             Driver Version: 570.148.08     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090 D      On  |   00000000:4F:00.0 Off |                  Off |
| 66%   62C    P0            393W /  400W |   47404MiB /  49140MiB |     98%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090 D      On  |   00000000:F5:00.0 Off |                  Off |
| 30%   35C    P8             18W /  400W |       1MiB /  49140MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A           38730      C   python                                47394MiB |
+-----------------------------------------------------------------------------------------+
sh-5.1# 

fenar avatar Aug 19 '25 16:08 fenar