yugabyte-db [DST] docdb metrics needed for better alerting

[DST] docdb metrics needed for better alerting

Open iSignal opened this issue 3 years ago • 0 comments

Jira Link: DB-2264 The following docdb metrics could help us alert better:

master lb count of tasks remaining : could be used to alert for a stuck master lb by checking for tasks != 0 and rate(tasks) == 0
count of under replicated tablets: could be used to alert for permanent failures by checking for under_repl_tablets != 0 and rate(under_repl_tablets) >= 0
process start time for master, tserver processes (in seconds from epoch). This is required to properly implement unexpected process restart alerts. Currently, we're getting it from health check, which causes false alerts on universe upgrade (because health check is executed after universe operation is completed already).
Used YSQL connections count. Needed for YSQL connections count alert

Lower priority for Platform alerts:

Count of leaderless tablets, count of total tablets :
Max/avg tablet size with the max/avg computed locally instead of exporting per-tablet metrics
Max/avg number of sst files with max/avg computed locally at each server instead of exporting per-tablet metrics
expand the master lb metrics to include: (a) count of imbalanced tablet leaders (b) count of over replicated tablets.

Aug 24 '21 22:08 iSignal