dcgm-exporter
dcgm-exporter copied to clipboard
pkg/dcgmexporter/gpu_collector.go: include a err_msg label in metric DCGM_FI_DEV_XID_ERRORS
The DCGM_FI_DEV_XID_ERRORS metric reports xid error code value, this commit include a err_msg label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4
@bom-d-van , Thank you for your contribution. Can you describe your use case to justify the change?
Hi @nvvfedorov , this is to make it easy to generate alarm messages using the metric.
For example, we could write a query like this to generate an alarm and use the err_msg in the template to make the error message easy to read.
max(DCGM_FI_DEV_XID_ERRORS{err_code=~"4|8|9|12|13|24|30|31|37|38|43|48|54|74|119|140|143"}) by (Hostname, DCGM_FI_DRIVER_VERSION, device, gpu, modelName, err_msg)
@bom-d-van , Please sign your commits and squash your commits into a single one. Then I will be ready to merge changes.
@nvvfedorov should be done now. could you take another look? tx.