node_exporter icon indicating copy to clipboard operation
node_exporter copied to clipboard

Add hw_counters metrics for infiniband device.

Open dongjiang1989 opened this issue 2 years ago • 16 comments

Add hw_counters metrics for infiniband device.

ref: https://github.com/prometheus/procfs/pull/549

dongjiang1989 avatar Oct 17 '23 09:10 dongjiang1989

Beside these, maybe consider combining metrics that can be summed up (like multiple types of errors etc) into one with different labels. See https://prometheus.io/docs/practices/naming/

discordianfish avatar Oct 19 '23 09:10 discordianfish

Hi, how is progress on this? We would love to have these counters :)

achow avatar Jan 29 '24 22:01 achow

thanks @discordianfish. Please re-check

dongjiang1989 avatar Feb 02 '24 09:02 dongjiang1989

This would be really useful for us, any update?

jgrund avatar Feb 20 '24 16:02 jgrund

@SuperQ Please recheck it, thanks.

dongjiang1989 avatar Feb 21 '24 16:02 dongjiang1989

Interested in this PR too.

LouisDDN avatar Feb 23 '24 14:02 LouisDDN

Ping again

jgrund avatar Mar 14 '24 15:03 jgrund

The general idea also applies to the other metrics. We need to carefully think about the names since changing them in the future is a breaking change that will need to be enabled by adding a flag etc etc. So sorry if that feels pedentic but we need to make sure we get these names right.

Also, see my other comment here: https://github.com/prometheus/node_exporter/pull/2827#issuecomment-1770448902

PS: Sorry for not being responsive. I don't get paid to work on this so my availability is limited :-/

discordianfish avatar Mar 21 '24 11:03 discordianfish

The general idea also applies to the other metrics. We need to carefully think about the names since changing them in the future is a breaking change that will need to be enabled by adding a flag etc etc. So sorry if that feels pedentic but we need to make sure we get these names right.

Also, see my other comment here: #2827 (comment)

PS: Sorry for not being responsive. I don't get paid to work on this so my availability is limited :-/

Thank you for your review. I will fix these issues as soon as possible. 🙏

dongjiang1989 avatar Mar 21 '24 16:03 dongjiang1989

@discordianfish Please re-check, thanks

https://github.com/prometheus/node_exporter/pull/2827/commits/8b5605040ae8f57566211fd7e557dc12363f5965

dongjiang1989 avatar Mar 22 '24 01:03 dongjiang1989

Better, thanks! Still, can you consider combining some of these metrics using the same name and different labels?

discordianfish avatar Mar 22 '24 11:03 discordianfish

Better, thanks! Still, can you consider combining some of these metrics using the same name and different labels?

Thanks for your review. Do you have any suggestions to combine some of these indicators using the same name and different labels? @discordianfish

dongjiang1989 avatar Mar 26 '24 09:03 dongjiang1989

@dongjiang1989 That depends on the metrics, I don't really have the resources to go through all these and suggest something. See the linked best practices, particular this quote:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics. For example, having the capacity of various queues in one metric is good, while mixing the capacity of a queue with the current number of elements in the queue is not.

One candidate might be the error metrics where you could have one and distinguish between them via some sort of error type label etc

discordianfish avatar Apr 10 '24 10:04 discordianfish

The hw_counters is still important metrics, Hope to continue updating!

xuchenhui-5 avatar Jun 04 '24 11:06 xuchenhui-5