pgcat icon indicating copy to clipboard operation
pgcat copied to clipboard

Fix: Always export server state metrics (is_banned, is_paused)

Open dog-64 opened this issue 1 month ago • 0 comments

Problem

The current implementation has a critical issue with gauge-type metrics like is_banned and is_paused:

Gauge metrics retain their last reported value indefinitely until explicitly updated. This means:

  • If a server was banned (is_banned=1) and then information about it becomes unavailable
  • The metric will continue showing is_banned=1 forever until explicitly set to 0
  • This creates a false positive in monitoring: the metrics show the server is banned even after it's been unbanned

This is especially problematic because state metrics should always reflect the current actual state, not a stale cached value.

Solution

This PR refactors push_server_stats() to separate state and activity metrics:

  • state_metrics (is_banned, is_paused) are now exported for every server on every metrics collection cycle

    • Ensures metrics always reflect current state
    • Guarantees stale values are immediately updated
  • activity_metrics (bytes_received, bytes_sent, etc.) remain conditional

    • Only exported when server_info is available
    • Reduces noise for inactive servers

Impact

  • Prometheus metrics now correctly reflect real-time server state
  • No false positives from stale gauge values
  • Monitoring/alerting based on is_banned and is_paused becomes reliable
  • No breaking changes to metric format or API

Testing

  • cargo check passes without errors
  • Metrics endpoint functionality unchanged

dog-64 avatar Nov 23 '25 10:11 dog-64