docs: Add best practices for metrics
What this PR does / why we need it: This PR introduces a doc for best practices around metrics. Hoping to get some comments and an agreement on these ideas. @rexagod @dgrisonnet @dashpole @logicalhan @CatherineF-dev
How does this change affect the cardinality of KSM: (increases, decreases or does not change cardinality)
None
Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #
LGTM for Stability
I'm also trying to suggest we move most labels into a single metric
metric_with_lots_of_labels{uid=identifier, a1=a1,a2=a2,...an=an} 1
instead of
metric_for_a1{uid=identifier,a1=a1}
metric_for_a2{uid=identifier,a2=a2}
....
metric_for_an{uid=identifier,an=an}
unless there are good reasons why the latter one is better for a TSDB/Querying etc
I believe field-specific metrics may help (a) limit cardinality (ref)? In the original issue, there was talk about including only tightly bounded labelsets in the _info metrics, however, that may lead to separate metrics that may fall into the "gray area" and (b) be ambiguous to the user if they are in an _info metric or not, as there isn't a feasible way to guess their inclusion or exclusion from the same without knowing a fields possible values beforehand. Not to mention, (c) static values may become dynamic in the future and cause breaking changes when moved to their own metrics (referring to the metric stability spec for BETA metrics).
I believe field-specific metrics may help (a) limit cardinality (ref)? In the original issue, there was talk about including only tightly bounded labelsets in the
_infometrics, however, that may lead to separate metrics that may fall into the "gray area" and (b) be ambiguous to the user if they are in an_infometric or not, as there isn't a feasible way to guess their inclusion or exclusion from the same without knowing a fields possible values beforehand. Not to mention, (c) static values may become dynamic in the future and cause breaking changes when moved to their own metrics (referring to the metric stability spec for BETA metrics).
This is why you need to look at the lifecycle of the object. If it's not lifetime scoped, it's dynamic. I'd rather have a few metric series with more labels than a bunch of metrics to avoid bad UX during querying where you need to do a few group lefts and joins to get to the result you want to see.
/triage accepted /assign @rexagod
@dgrisonnet @CatherineF-dev @rexagod I think this is in a good state now. :)
/assign @dgrisonnet
For visibility.
/assign /lgtm
Talked this through with Damien offline, no blockers.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: brancz, mrueg, rexagod, SuperQ
The full list of commands accepted by this bot can be found here.
The pull request process is described here
- ~~OWNERS~~ [mrueg,rexagod]
Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment