gpu-operator
gpu-operator copied to clipboard
Auto Power-Limit Settings with Operator Policy
Describe the Ask Many GPU(s) in the wild getting py-torched with high heat abuse and they die fast. GPU Operator shall offer a power limiting/regulating feature to avoid over-heat & burn.
To Reproduce Run your 600Watts Air/Poorly-Cooled GPUs next to each other for model train/fine-tune and observe accumulated temps surpassing >100C Temps.
Requested feature GPU Operator could able to monitor operating temperature of each GPU and keep it in allowed Temp Range, by using "nvidia-smi -pl" power limiting ability with autonomously if this feature is enabled.
Environment (please provide the following information):
- GPU Operator Version: 25.3.2
- Container Runtime Version: Driver Version: 570.148.08 CUDA Version: 12.8
- Kubernetes Distro and Version: RH OCP 4.19
Information to attach (optional if deemed irrelevant)
sh-5.1# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08 Driver Version: 570.148.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 D On | 00000000:4F:00.0 Off | Off |
| 94% 97C P0 583W / 600W | 47404MiB / 49140MiB | 98% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 D On | 00000000:F5:00.0 Off | Off |
| 30% 36C P8 18W / 400W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 38730 C python 47394MiB |
+-----------------------------------------------------------------------------------------+
sh-5.1# nvidia-smi -pl 400 -i 0
Power limit for GPU 00000000:4F:00.0 was set to 400.00 W from 600.00 W.
All done.
<--Cool Down Period->
sh-5.1# nvidia-smi
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.148.08 Driver Version: 570.148.08 CUDA Version: 12.8 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 D On | 00000000:4F:00.0 Off | Off |
| 66% 62C P0 393W / 400W | 47404MiB / 49140MiB | 98% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA GeForce RTX 4090 D On | 00000000:F5:00.0 Off | Off |
| 30% 35C P8 18W / 400W | 1MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 38730 C python 47394MiB |
+-----------------------------------------------------------------------------------------+
sh-5.1#