pgcat
pgcat copied to clipboard
Fix: Always export server state metrics (is_banned, is_paused)
Problem
The current implementation has a critical issue with gauge-type metrics like is_banned and is_paused:
Gauge metrics retain their last reported value indefinitely until explicitly updated. This means:
- If a server was banned (
is_banned=1) and then information about it becomes unavailable - The metric will continue showing
is_banned=1forever until explicitly set to0 - This creates a false positive in monitoring: the metrics show the server is banned even after it's been unbanned
This is especially problematic because state metrics should always reflect the current actual state, not a stale cached value.
Solution
This PR refactors push_server_stats() to separate state and activity metrics:
-
state_metrics (
is_banned,is_paused) are now exported for every server on every metrics collection cycle- Ensures metrics always reflect current state
- Guarantees stale values are immediately updated
-
activity_metrics (bytes_received, bytes_sent, etc.) remain conditional
- Only exported when server_info is available
- Reduces noise for inactive servers
Impact
- Prometheus metrics now correctly reflect real-time server state
- No false positives from stale gauge values
- Monitoring/alerting based on
is_bannedandis_pausedbecomes reliable - No breaking changes to metric format or API
Testing
-
cargo checkpasses without errors - Metrics endpoint functionality unchanged