tetragon icon indicating copy to clipboard operation
tetragon copied to clipboard

Fix monitoring BPF maps

Open lambdanis opened this issue 8 months ago • 3 comments

Tetragon exposes metrics about BPF maps, defined in the mapmetrics package. There are a few issues with them:

  • [ ] Map capacity is exposed as total label in tetragon_map_in_use_gauge metric (and also, for eventcache it's set to "0"). This is against OpenMetrics conventions and makes it difficult (or impossible) to calculate the map utilization. Map capacity should be exposed as a separate metric. Alternatively, Tetragon could calculate the map utilization itself, but exposing the map capacity seems better.
  • [ ] The tetragon_map_in_use_gauge metric name is not very intuitive IMO and the gauge suffix is unnecessary.
  • [ ] processLru and eventcache are not BPF map, they're user-space caches, but info about them is exposed via mapmetrics too. This is confusing, it should be exposed as separate metrics.
  • [ ] tetragon_map_errors_total metric seems to have an incorrect help text.
  • [ ] tetragon_map_drops_total metric is incorrect. We get this from this callback. But this is not called only on evictions, it is also called when we remove elements normally, as we do here. The evictions are counted be the tetragon_errors_total{type="process_cache_evicted"} metric (defined here), so tetragon_map_drops_total can be just removed.
  • [ ] Metrics are not exposed for all maps: https://github.com/cilium/tetragon/issues/1775

lambdanis avatar Nov 20 '23 17:11 lambdanis