promscale icon indicating copy to clipboard operation
promscale copied to clipboard

Metrics for long-running statements and locks originating from maintenance jobs

Open sumerman opened this issue 2 years ago • 2 comments

Description

Adding two more metric types from #492

Merge requirements

Please take into account the following non-code changes that you may need to make with your PR:

  • [x] CHANGELOG entry for user-facing changes
  • [ ] Updated the relevant documentation

sumerman avatar Sep 21 '22 10:09 sumerman

I was wondering today, if we could add metrics from pg_stat_activity such that, this metric promscale_sql_database_pg_stat_activity{backend_type="", query="", pid="", application_name=""} will have 2 value, 0 or 1. If an entry exists in the database, then 1, otherwise 0.

This metric will be very useful in debugging since it catches long-running services in the database (the graph will show a long line steady at 1 for several hours/days), which could potentially be the cause of CPU load.

Based on the design of this metric, the existing db metrics engine might not be sufficient, so need not be part of this PR. But, we should make an issue of this.

Harkishen-Singh avatar Sep 23 '22 11:09 Harkishen-Singh

I was wondering today, if we could add metrics from pg_stat_activity such that, this metric promscale_sql_database_pg_stat_activity{backend_type="", query="", pid="", application_name=""} will have 2 value, 0 or 1. If an entry exists in the database, then 1, otherwise 0.

This metric will be very useful in debugging since it catches long-running services in the database (the graph will show a long line steady at 1 for several hours/days), which could potentially be the cause of CPU load.

Based on the design of this metric, the existing db metrics engine might not be sufficient, so need not be part of this PR. But, we should make an issue of this.

Right now, "long-running" is pretty arbitrary: it set it to be longer than 10 seconds because I wanted to narrow down what this metric shows. In a real, busy system majority of the jobs could have long-running statements. Instead of creating a binary metric, I would suggest adding a "time of a longest-running statement" metric, and then operators are free to choose their thresholds.

sumerman avatar Sep 23 '22 11:09 sumerman