DCGM
DCGM copied to clipboard
DCGM Telegraf Issues v2.3.4-1
Hello, We're seeing an issue with dccmd-telegraf.service with this release of DCGM and RHEL 8.4 EUS. Service will start but will crash right around the 1 minute mark with the following:
systemd[1]: Started DCGM Telegraf service.
python3[116631]: Traceback (most recent call last):
python3[116631]: File "/usr/local/dcgm/bindings/python3/dcgm_telegraf.py", line 63, in <module>
python3[116631]: main(DcgmTelegraf, TELEGRAF_NAME, DEFAULT_TELEGRAF_PORT, add_target_host=True)
python3[116631]: File "/usr/local/dcgm/bindings/python3/common/dcgm_client_main.py", line 81, in main
python3[116631]: dr.Process()
python3[116631]: File "/usr/local/dcgm/bindings/python3/DcgmReader.py", line 443, in Process
python3[116631]: self.dfvc = self.m_dcgmGroup.samples.GetAllSinceLastCall(self.dfvc, self.m_fieldGroup)
python3[116631]: File "/usr/local/dcgm/bindings/python3/DcgmGroup.py", line 162, in GetAllSinceLastCall
python3[116631]: if dfvc.values.len() == 0:
python3[116631]: AttributeError: 'dict' object has no attribute 'len'
systemd[1]: dcgmd-telegraf.service: Main process exited, code=exited, status=1/FAILURE
systemd[1]: dcgmd-telegraf.service: Failed with result 'exit-code'.
Using Prometheus isn't an option for us due to some operational security requirements and all other monitoring of our facility uses Telegraf/InfluxDB/Grafana (we also aren't running K8s on these machines, they are in an HPC cluster). Any guidance on how to resolve would be appreciated.