compute: metrics for autovacuum (per-db)
Problem
Currently we do not have metrics for autovacuum.
Summary of changes
Adds a pg_autovacuum_stats metric that tracks the autovacuum count, auto analyze count and last autovacuum for all tables.
Related issue: https://github.com/neondatabase/cloud/issues/27296
8481 tests run: 7855 passed, 44 failed, 582 skipped (full report)
Failures on Postgres 17
-
test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_smoke[compute]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[compute]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc
Failures on Postgres 16
-
test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_smoke[compute]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[compute]: release-x86-64-with-lfc, release-arm64-with-lfc
Failures on Postgres 15
-
test_sql_exporter_metrics_e2e[compute]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_smoke[compute]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc
Failures on Postgres 14
-
test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_smoke[compute]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc -
test_sql_exporter_metrics_e2e[compute]: release-x86-64-with-lfc, release-arm64-with-lfc
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sql_exporter_metrics_smoke[release-pg14-autoscaling] or test_sql_exporter_metrics_smoke[release-pg14-autoscaling] or test_sql_exporter_metrics_smoke[release-pg14-compute] or test_sql_exporter_metrics_smoke[release-pg14-compute] or test_sql_exporter_metrics_e2e[release-pg14-autoscaling] or test_sql_exporter_metrics_e2e[release-pg14-autoscaling] or test_sql_exporter_metrics_e2e[release-pg14-compute] or test_sql_exporter_metrics_e2e[release-pg14-compute] or test_sql_exporter_metrics_e2e[release-pg15-compute] or test_sql_exporter_metrics_e2e[release-pg15-compute] or test_sql_exporter_metrics_smoke[release-pg15-compute] or test_sql_exporter_metrics_smoke[release-pg15-compute] or test_sql_exporter_metrics_smoke[release-pg15-autoscaling] or test_sql_exporter_metrics_smoke[release-pg15-autoscaling] or test_sql_exporter_metrics_e2e[release-pg15-autoscaling] or test_sql_exporter_metrics_e2e[release-pg15-autoscaling] or test_sql_exporter_metrics_smoke[release-pg16-autoscaling] or test_sql_exporter_metrics_smoke[release-pg16-autoscaling] or test_sql_exporter_metrics_smoke[release-pg16-compute] or test_sql_exporter_metrics_smoke[release-pg16-compute] or test_sql_exporter_metrics_e2e[release-pg16-autoscaling] or test_sql_exporter_metrics_e2e[release-pg16-autoscaling] or test_sql_exporter_metrics_e2e[release-pg16-compute] or test_sql_exporter_metrics_e2e[release-pg16-compute] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[debug-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_smoke[debug-pg17-compute] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[debug-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-compute] or test_sql_exporter_metrics_e2e[debug-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-compute]"
Flaky tests (2)
Postgres 17
-
test_tx_abort_with_many_relations[v1]: release-arm64-with-lfc
Postgres 16
-
test_backward_compatibility: release-arm64-with-lfc
Test coverage report is not available
8392c41546f26e3503442e688b358ad0d5e65209 at 2025-06-11T23:51:42.678Z :recycle:
@thesuhas do you have an issue that explains why we need this, and did you pass that by the SRE team? I'm asking, because this sounds like a metrics explosion hazard to me, and I don't think SRE will enjoy that.
@thesuhas do you have an issue that explains why we need this, and did you pass that by the SRE team? I'm asking, because this sounds like a metrics explosion hazard to me, and I don't think SRE will enjoy that.
@MMeent There's this comment: https://github.com/neondatabase/cloud/issues/27296#issuecomment-2792299391
I did not pass it by SRE, I will do that. I am limiting by 10 to prevent an explosion of tables per db, I can add that to the number of dbs as well. Do you think this will suffice?
Do you think this will suffice?
No, because in practice autovacuum will cause the values returned by "10 most recently vacuumed tables" to churn over time, through most tables. It won't return more than 10 series per scrape, but over time it will go through all tables in its labels, and that's the primary issue here.
Additionally, "most recently vacuumed" has never been a good indicator of "should it have been vacuumed" or "how well are we doing with the rest of the tables".
Additionally, "most recently vacuumed" has never been a good indicator of "should it have been vacuumed" or "how well are we doing with the rest of the tables".
Would this metric be a good indicator of how often autovacuum runs? (https://github.com/neondatabase/cloud/issues/27296#issuecomment-2792299391)
No, if you want to measure how often autovacuum runs, you're probably better off summing up the values of pg_stat_all_tables.autovacuum_count for all tables of all databases.
No, if you want to measure how often autovacuum runs, you're probably better off summing up the values of
pg_stat_all_tables.autovacuum_countfor all tables of all databases.
Forgive me if I'm wrong, but other than being an easier to read/interpret metric, how would this be better from an efficiency perspective?
No, because in practice autovacuum will cause the values returned by "10 most recently vacuumed tables" to churn over time, through most tables. It won't return more than 10 series per scrape, but over time it will go through all tables in its labels, and that's the primary issue here.
Won't this happen in this case as well? You will still go through the entirety of pg_stat_all_tables and sum up the autovacuum_count column. Is this better from a query execution perspective and be more optimized?
Won't this happen in this case as well? You will still go through the entirety of pg_stat_all_tables and sum up the autovacuum_count column.
Yes, but at least (when summed) it won't output a different set of labels (i.e., not {table_name + schema_name + database_name}, but just {}) every time the query is run and a new table was vacuumed. It'll just have the same empty set of labels, but with 1 (or 10, or 100, or whatever) added to the value, and that saves a lot in metrics storage. It's rather cheap to store 100 measurements of the same metric (= set of labels), but it's much more expensive to store 1 measurement of 100 metrics.
Is this better from a query execution perspective and be more optimized?
Not necessarily, it's just better for the processes outside sql_exporter from a gathering, storage, and processing perspective.
No, if you want to measure how often autovacuum runs, you're probably better off summing up the values of pg_stat_all_tables.autovacuum_count for all tables of all databases.
Forgive me if I'm wrong, but other than being an easier to read/interpret metric, how would this be better from an efficiency perspective?
Because that removes the need for a separate metric labelset for every table, because you're aggregating all the data of the cluster (or, maybe, database) in just one value, rather than one per table.
Flux Fleet PR which adds pg_autovacuum_stats to metrics that have to be collected: https://github.com/neondatabase/flux-fleet/pull/660
Closing for now as connection to non-postgres DBs via dblink still counts as activity and prevents compute suspend.