tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Define debug metrics group

Open lambdanis opened this issue 1 year ago • 1 comments

Separate metrics monitoring Tetragon health (used by operators) from metrics exposing details useful for debugging (used mainly by Tetragon developers, potentially high-cardinality). The idea is to disable the latter by default, to reduce the default metrics cardinality and performance overhead.

See Tetragon metrics framework for more context.

  • [ ] Define debug metrics group (unconstrained). See how health metrics group is defined: https://github.com/cilium/tetragon/blob/main/pkg/metricsconfig/healthmetrics.go
  • [ ] Identify debug metrics within the health group and move them into debug group. This would probably include:
    • metrics documented as "for internal use only"
    • metrics with unconstrained cardinality, e.g. "kprobe" label
    • any other metrics intended for Tetragon developers rather than operators
  • [ ] Move debug metrics to a separate endpoint (breaking change)
  • [ ] Disable debug metrics by default (breaking change)
  • [ ] Adjust how metrics docs are generated
  • [ ] Remove "For internal use only" annotation from the metrics help texts. The fact of being in the debug group indicates whether a metric is considered "internal".

After this is done, health metrics group should be marked as constrained.

lambdanis avatar Aug 08 '24 00:08 lambdanis

Identified debug metrics

(not a complete list)

  • tetragon_bpf_missed_events_total

ghost avatar Aug 24 '24 10:08 ghost