dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

Extremely high GPU temperature reported by dcgm-exporter

Open age9990 opened this issue 1 year ago • 5 comments

What is the version?

3.3.5-3.4.0

What happened?

After upgrading from 3.1.3-3.1.2 to 3.3.5-3.4.0, GPU temperature metric DCGM_FI_DEV_GPU_TEMP occationally reports extremely large number, ex: 345, 82505, 200000, 644245923. It never happened when using 3.1.3-3.1.2 version.

What did you expect to happen?

Correct GPU temperature is returned.

What is the GPU model?

A100, H100

What is the environment?

K8S v1.23, v1.24

How did you deploy the dcgm-exporter and what is the configuration?

Default setting come from gpu-operator v23.9.2

How to reproduce the issue?

No response

Anything else we need to know?

No response

age9990 avatar Apr 11 '24 06:04 age9990

@age9990. Thank you for reporting the bug. Can you run the dcgm-exporter in debug mode and share logs? Here is how you can do this:

dcgm-exporter --debug true --enable-dcgm-log true --dcgm-log-level DEBUG

nvvfedorov avatar Apr 11 '24 14:04 nvvfedorov

@nvvfedorov Updated to 3.3.5-3.4.1, and enabled debug mode and log, the issue still happened.

time="2024-04-14T20:33:16Z" level=info msg="Appended entity i64 eg 1, eid 1, fieldId 150, ts 1713126796036793 , value1 4278190080, value2 0, cached 1, buffered 0 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:6262] [DcgmCacheManager::AppendEntityInt64]" dcgm_level=DEBUG time="2024-04-14T20:33:16Z" level=info msg="Preparing to update watchInfo 0x7f9c34005fe0, eg 1, eid 1, fieldId 150 [/workspaces/dcgm-rel_dcgm_3_3-postmerge/dcgmlib/src/DcgmCacheManager.cpp:5277] [DcgmCacheManager::ActuallyUpdateAllFields]" dcgm_level=DEBUG

age9990 avatar Apr 15 '24 01:04 age9990

I've encountered the same problem with the dcgm-exporter version: 3.3.3-3.3.0, so I upgraded version to 3.3.5-3.4.1 and enabled debug options as you mentioned above....

Before this I've created my own system logger (systemd service) for temperature with the usage of nvidia-smi, while true; do nvidia-smi -q -d temperature; done > /var/log/templog.log when I compared the temperature values from the dcgm-exporter I found out that temperature was normal from nvidia-smi and dcgm-exporter reported different temperature values, when I will have log from dcgm-exporter I will paste it here later

Tested on K8s: v1.29.0

jz543fm avatar Apr 23 '24 11:04 jz543fm

Tested 3.3.0-3.2.0 & 3.1.6-3.1.3, same problem. Rollback to use 3.1.3-3.1.2, hope this issue can be fixed soon.

age9990 avatar Apr 24 '24 09:04 age9990

The defect is on the DCGM (https://github.com/NVIDIA/dcgm/) side, we're waiting for a new DCGM release.

nvvfedorov avatar Apr 24 '24 15:04 nvvfedorov

@nvvfedorov new DCGM 3.3.6 release is out and fixes this issue. Hope new version of DCGM Exporter will be released ASAP, much appreciated.

age9990 avatar May 20 '24 08:05 age9990

@age9990 , The new version of the DCGM-exporter has been released: 3.3.6-3.4.2

nvvfedorov avatar May 20 '24 20:05 nvvfedorov