kube-state-metrics docs: Add best practices for metrics

What this PR does / why we need it: This PR introduces a doc for best practices around metrics. Hoping to get some comments and an agreement on these ideas. @rexagod @dgrisonnet @dashpole @logicalhan @CatherineF-dev

How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality) None Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #

Oct 17 '24 21:10 mrueg

LGTM for Stability

Oct 22 '24 14:10 CatherineF-dev

I'm also trying to suggest we move most labels into a single metric

metric_with_lots_of_labels{uid=identifier, a1=a1,a2=a2,...an=an} 1

instead of

metric_for_a1{uid=identifier,a1=a1}
metric_for_a2{uid=identifier,a2=a2}
....
metric_for_an{uid=identifier,an=an}

unless there are good reasons why the latter one is better for a TSDB/Querying etc

Oct 22 '24 15:10 mrueg

I believe field-specific metrics may help (a) limit cardinality (ref)? In the original issue, there was talk about including only tightly bounded labelsets in the _info metrics, however, that may lead to separate metrics that may fall into the "gray area" and (b) be ambiguous to the user if they are in an _info metric or not, as there isn't a feasible way to guess their inclusion or exclusion from the same without knowing a fields possible values beforehand. Not to mention, (c) static values may become dynamic in the future and cause breaking changes when moved to their own metrics (referring to the metric stability spec for BETA metrics).

Oct 24 '24 08:10 rexagod

I believe field-specific metrics may help (a) limit cardinality (ref)? In the original issue, there was talk about including only tightly bounded labelsets in the _info metrics, however, that may lead to separate metrics that may fall into the "gray area" and (b) be ambiguous to the user if they are in an _info metric or not, as there isn't a feasible way to guess their inclusion or exclusion from the same without knowing a fields possible values beforehand. Not to mention, (c) static values may become dynamic in the future and cause breaking changes when moved to their own metrics (referring to the metric stability spec for BETA metrics).

This is why you need to look at the lifecycle of the object. If it's not lifetime scoped, it's dynamic. I'd rather have a few metric series with more labels than a bunch of metrics to avoid bad UX during querying where you need to do a few group lefts and joins to get to the result you want to see.

Oct 28 '24 09:10 mrueg

/triage accepted /assign @rexagod

Oct 31 '24 17:10 richabanker

@dgrisonnet @CatherineF-dev @rexagod I think this is in a good state now. :)

Nov 14 '24 09:11 mrueg

/assign @dgrisonnet

For visibility.

Nov 15 '24 14:11 rexagod

/assign /lgtm

Talked this through with Damien offline, no blockers.

Nov 27 '24 15:11 rexagod

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: brancz, mrueg, rexagod, SuperQ

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [mrueg,rexagod]

Approvers can indicate their approval by writing /approve in a comment Approvers can cancel approval by writing /approve cancel in a comment

Nov 27 '24 15:11 k8s-ci-robot