yugabyte-db
yugabyte-db copied to clipboard
[DST] docdb metrics needed for better alerting
Jira Link: DB-2264 The following docdb metrics could help us alert better:
- master lb count of tasks remaining : could be used to alert for a stuck master lb by checking for
tasks != 0 and rate(tasks) == 0
- count of under replicated tablets: could be used to alert for permanent failures by checking for
under_repl_tablets != 0 and rate(under_repl_tablets) >= 0
- process start time for master, tserver processes (in seconds from epoch). This is required to properly implement unexpected process restart alerts. Currently, we're getting it from health check, which causes false alerts on universe upgrade (because health check is executed after universe operation is completed already).
- Used YSQL connections count. Needed for YSQL connections count alert
Lower priority for Platform alerts:
- Count of leaderless tablets, count of total tablets :
- Max/avg tablet size with the max/avg computed locally instead of exporting per-tablet metrics
- Max/avg number of sst files with max/avg computed locally at each server instead of exporting per-tablet metrics
- expand the master lb metrics to include: (a) count of imbalanced tablet leaders (b) count of over replicated tablets.