gpu-operator DCGM-Expoter msg="Could not retrieve ConfigMap ..."

I am running a cluster with a number of nvidia gpu. I'm also monitoring gpu using dcgm-exporter. However, sometimes the dcgm-exporter fails to give metrics with the logs below.

time="2022-08-19T07:25:51Z" level=info msg="Starting dcgm-exporter"
time="2022-08-19T07:25:51Z" level=info msg="DCGM successfully initialized!"
time="2022-08-19T07:25:51Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-08-19T07:26:21Z" level=info msg="Could not retrieve ConfigMap 'gpu-monitor:exporter-metrics-config-map': Get \"https://{ip}/api/v1/namespaces/gpu-monitor/configmaps/exporter-metrics-config-map\": dial tcp {ip}: i/o timeout, falling back to metric file /etc/dcgm-exporter/default-counters.csv"
time="2022-08-19T07:26:21Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-08-19T07:26:21Z" level=info msg="Pipeline starting"
time="2022-08-19T07:26:21Z" level=info msg="Starting webserver"

I think it is normal to restart Pod if the exporter has not found ConfigMap, but it doesn't. (Or at least it should be marked as not ready.) I would appreciate it if you could give me feedback or fix this issue after checking it.

Other normal dcgm-exporters have the following logs.

time="2022-08-05T00:13:49Z" level=info msg="Starting dcgm-exporter"
time="2022-08-05T00:13:49Z" level=info msg="DCGM successfully initialized!"
time="2022-08-05T00:13:49Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: This request is serviced by a module of DCGM that is not currently loaded"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 55 ('DCGM_FI_PROF_GR_ENGINE_ACTIVE'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 58 ('DCGM_FI_PROF_PIPE_TENSOR_ACTIVE'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 59 ('DCGM_FI_PROF_DRAM_ACTIVE'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 63 ('DCGM_FI_PROF_PCIE_TX_BYTES'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=warning msg="Skipping line 64 ('DCGM_FI_PROF_PCIE_RX_BYTES'): DCP metrics not enabled"
time="2022-08-05T00:13:49Z" level=info msg="Kubernetes metrics collection enabled!"
time="2022-08-05T00:13:49Z" level=info msg="Pipeline starting"
time="2022-08-05T00:13:49Z" level=info msg="Starting webserver"

Aug 30 '22 01:08 devnjw

@devnjw What env are you passing for dcgm-exporter? Are you trying to pass ConfigMap name using DCGM_EXPORTER_CONFIGMAP_DATA env? For custom metrics you can create a ConfigMap and deploy as here: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#custom-metrics-config.

Sep 07 '22 01:09 shivamerla

@shivamerla Thank you for your reply, but I don't need a custom exporter. I am asking if it is more appropriate to panic() the exporter when an error such as the first log above occurs.

Sep 07 '22 16:09 devnjw

got it, yes i will relay this to DCGM exporter team. When its configured to run using custom ConfigMap and that is not found, exporter should error out.

Sep 07 '22 18:09 shivamerla

@glowkey @dualvtable Please take a look at this.

Oct 20 '22 20:10 shivamerla

Tracking here: https://github.com/NVIDIA/dcgm-exporter/issues/111

Oct 21 '22 15:10 glowkey

@devnjw this should be fixed in newer versions of dcgm-exporter. Closing. Please re-open if you are still experiencing this issue.

Jan 31 '24 00:01 cdesiniotis

gpu-operator gpu-operator copied to clipboard

DCGM-Expoter msg="Could not retrieve ConfigMap ..."

gpu-operator
gpu-operator copied to clipboard