dcgm-exporter icon indicating copy to clipboard operation
dcgm-exporter copied to clipboard

pkg/dcgmexporter/gpu_collector.go: include a err_msg label in metric DCGM_FI_DEV_XID_ERRORS

Open bom-d-van opened this issue 10 months ago • 4 comments

The DCGM_FI_DEV_XID_ERRORS metric reports xid error code value, this commit include a err_msg label with value retrieved from this nvidia doc: https://docs.nvidia.com/deploy/xid-errors/#topic_4

bom-d-van avatar Apr 08 '24 02:04 bom-d-van

@bom-d-van , Thank you for your contribution. Can you describe your use case to justify the change?

nvvfedorov avatar Apr 15 '24 14:04 nvvfedorov

Hi @nvvfedorov , this is to make it easy to generate alarm messages using the metric.

For example, we could write a query like this to generate an alarm and use the err_msg in the template to make the error message easy to read.

max(DCGM_FI_DEV_XID_ERRORS{err_code=~"4|8|9|12|13|24|30|31|37|38|43|48|54|74|119|140|143"}) by (Hostname, DCGM_FI_DRIVER_VERSION, device, gpu, modelName, err_msg)

bom-d-van avatar Apr 16 '24 10:04 bom-d-van

@bom-d-van , Please sign your commits and squash your commits into a single one. Then I will be ready to merge changes.

nvvfedorov avatar Apr 29 '24 15:04 nvvfedorov

@nvvfedorov should be done now. could you take another look? tx.

bom-d-van avatar May 02 '24 03:05 bom-d-van