dcgm-exporter
dcgm-exporter copied to clipboard
Extremely high GPU temperature reported by dcgm-exporter
What is the version?
3.3.5-3.4.0
What happened?
After upgrading from 3.1.3-3.1.2 to 3.3.5-3.4.0, GPU temperature metric DCGM_FI_DEV_GPU_TEMP occationally reports extremely large number, ex: 345, 82505, 200000, 644245923. It never happened when using 3.1.3-3.1.2 version.
What did you expect to happen?
Correct GPU temperature is returned.
What is the GPU model?
A100, H100
What is the environment?
K8S v1.23, v1.24
How did you deploy the dcgm-exporter and what is the configuration?
Default setting come from gpu-operator v23.9.2
How to reproduce the issue?
No response
Anything else we need to know?
No response
@age9990. Thank you for reporting the bug. Can you run the dcgm-exporter in debug mode and share logs? Here is how you can do this:
dcgm-exporter --debug true --enable-dcgm-log true --dcgm-log-level DEBUG
@nvvfedorov Updated to 3.3.5-3.4.1, and enabled debug mode and log, the issue still happened.
time="2024-04-14T20:33:16Z" level=info msg="Appended entity i64 eg 1, eid 1, fieldId 150, ts 1713126796036793 , value1 4278190080, value2 0, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6262] [DcgmCacheManager::AppendEntityInt64]" dcgm_level=DEBUG time="2024-04-14T20:33:16Z" level=info msg="Preparing to update watchInfo 0x7f9c34005fe0, eg 1, eid 1, fieldId 150 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:5277] [DcgmCacheManager::ActuallyUpdateAllFields]" dcgm_level=DEBUG
I've encountered the same problem with the dcgm-exporter version: 3.3.3-3.3.0, so I upgraded version to 3.3.5-3.4.1 and enabled debug options as you mentioned above....
Before this I've created my own system logger (systemd service) for temperature with the usage of nvidia-smi, while true; do nvidia-smi -q -d temperature; done > /var/log/templog.log when I compared the temperature values from the dcgm-exporter I found out that temperature was normal from nvidia-smi and dcgm-exporter reported different temperature values, when I will have log from dcgm-exporter I will paste it here later
Tested on K8s: v1.29.0
Tested 3.3.0-3.2.0 & 3.1.6-3.1.3, same problem. Rollback to use 3.1.3-3.1.2, hope this issue can be fixed soon.
The defect is on the DCGM (https://github.com/NVIDIA/dcgm/) side, we're waiting for a new DCGM release.
@nvvfedorov new DCGM 3.3.6 release is out and fixes this issue. Hope new version of DCGM Exporter will be released ASAP, much appreciated.
@age9990 , The new version of the DCGM-exporter has been released: 3.3.6-3.4.2