neon compute: metrics for autovacuum (per-db)

Problem

Currently we do not have metrics for autovacuum.

Summary of changes

Adds a pg_autovacuum_stats metric that tracks the autovacuum count, auto analyze count and last autovacuum for all tables.

Related issue: https://github.com/neondatabase/cloud/issues/27296

Jun 10 '25 23:06 thesuhas

8481 tests run: 7855 passed, 44 failed, 582 skipped (full report)

Failures on Postgres 17

test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_smoke[compute]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[compute]: release-x86-64-without-lfc, release-x86-64-with-lfc, debug-x86-64-without-lfc, release-arm64-without-lfc, release-arm64-with-lfc

Failures on Postgres 16

test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_smoke[compute]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[compute]: release-x86-64-with-lfc, release-arm64-with-lfc

Failures on Postgres 15

test_sql_exporter_metrics_e2e[compute]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_smoke[compute]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc

Failures on Postgres 14

test_sql_exporter_metrics_smoke[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_smoke[compute]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[autoscaling]: release-x86-64-with-lfc, release-arm64-with-lfc
test_sql_exporter_metrics_e2e[compute]: release-x86-64-with-lfc, release-arm64-with-lfc

# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_sql_exporter_metrics_smoke[release-pg14-autoscaling] or test_sql_exporter_metrics_smoke[release-pg14-autoscaling] or test_sql_exporter_metrics_smoke[release-pg14-compute] or test_sql_exporter_metrics_smoke[release-pg14-compute] or test_sql_exporter_metrics_e2e[release-pg14-autoscaling] or test_sql_exporter_metrics_e2e[release-pg14-autoscaling] or test_sql_exporter_metrics_e2e[release-pg14-compute] or test_sql_exporter_metrics_e2e[release-pg14-compute] or test_sql_exporter_metrics_e2e[release-pg15-compute] or test_sql_exporter_metrics_e2e[release-pg15-compute] or test_sql_exporter_metrics_smoke[release-pg15-compute] or test_sql_exporter_metrics_smoke[release-pg15-compute] or test_sql_exporter_metrics_smoke[release-pg15-autoscaling] or test_sql_exporter_metrics_smoke[release-pg15-autoscaling] or test_sql_exporter_metrics_e2e[release-pg15-autoscaling] or test_sql_exporter_metrics_e2e[release-pg15-autoscaling] or test_sql_exporter_metrics_smoke[release-pg16-autoscaling] or test_sql_exporter_metrics_smoke[release-pg16-autoscaling] or test_sql_exporter_metrics_smoke[release-pg16-compute] or test_sql_exporter_metrics_smoke[release-pg16-compute] or test_sql_exporter_metrics_e2e[release-pg16-autoscaling] or test_sql_exporter_metrics_e2e[release-pg16-autoscaling] or test_sql_exporter_metrics_e2e[release-pg16-compute] or test_sql_exporter_metrics_e2e[release-pg16-compute] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[debug-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-autoscaling] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_smoke[debug-pg17-compute] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_smoke[release-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[debug-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-autoscaling] or test_sql_exporter_metrics_e2e[release-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-compute] or test_sql_exporter_metrics_e2e[debug-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-compute] or test_sql_exporter_metrics_e2e[release-pg17-compute]"

Flaky tests (2)

Postgres 17

test_tx_abort_with_many_relations[v1]: release-arm64-with-lfc

Postgres 16

test_backward_compatibility: release-arm64-with-lfc

Test coverage report is not available

_{The comment gets automatically updated with the latest test results
8392c41546f26e3503442e688b358ad0d5e65209 at 2025-06-11T23:51:42.678Z :recycle:}

Jun 11 '25 01:06 github-actions[bot]

@thesuhas do you have an issue that explains why we need this, and did you pass that by the SRE team? I'm asking, because this sounds like a metrics explosion hazard to me, and I don't think SRE will enjoy that.

Jun 11 '25 09:06 MMeent

@thesuhas do you have an issue that explains why we need this, and did you pass that by the SRE team? I'm asking, because this sounds like a metrics explosion hazard to me, and I don't think SRE will enjoy that.

@MMeent There's this comment: https://github.com/neondatabase/cloud/issues/27296#issuecomment-2792299391

I did not pass it by SRE, I will do that. I am limiting by 10 to prevent an explosion of tables per db, I can add that to the number of dbs as well. Do you think this will suffice?

Jun 11 '25 15:06 thesuhas

Do you think this will suffice?

No, because in practice autovacuum will cause the values returned by "10 most recently vacuumed tables" to churn over time, through most tables. It won't return more than 10 series per scrape, but over time it will go through all tables in its labels, and that's the primary issue here.

Additionally, "most recently vacuumed" has never been a good indicator of "should it have been vacuumed" or "how well are we doing with the rest of the tables".

Jun 11 '25 15:06 MMeent

Additionally, "most recently vacuumed" has never been a good indicator of "should it have been vacuumed" or "how well are we doing with the rest of the tables".

Would this metric be a good indicator of how often autovacuum runs? (https://github.com/neondatabase/cloud/issues/27296#issuecomment-2792299391)

Jun 11 '25 17:06 thesuhas

No, if you want to measure how often autovacuum runs, you're probably better off summing up the values of pg_stat_all_tables.autovacuum_count for all tables of all databases.

Jun 11 '25 17:06 MMeent

No, if you want to measure how often autovacuum runs, you're probably better off summing up the values of pg_stat_all_tables.autovacuum_count for all tables of all databases.

Forgive me if I'm wrong, but other than being an easier to read/interpret metric, how would this be better from an efficiency perspective?

No, because in practice autovacuum will cause the values returned by "10 most recently vacuumed tables" to churn over time, through most tables. It won't return more than 10 series per scrape, but over time it will go through all tables in its labels, and that's the primary issue here.

Won't this happen in this case as well? You will still go through the entirety of pg_stat_all_tables and sum up the autovacuum_count column. Is this better from a query execution perspective and be more optimized?

Jun 11 '25 17:06 thesuhas

Won't this happen in this case as well? You will still go through the entirety of pg_stat_all_tables and sum up the autovacuum_count column.

Yes, but at least (when summed) it won't output a different set of labels (i.e., not {table_name + schema_name + database_name}, but just {}) every time the query is run and a new table was vacuumed. It'll just have the same empty set of labels, but with 1 (or 10, or 100, or whatever) added to the value, and that saves a lot in metrics storage. It's rather cheap to store 100 measurements of the same metric (= set of labels), but it's much more expensive to store 1 measurement of 100 metrics.

Is this better from a query execution perspective and be more optimized?

Not necessarily, it's just better for the processes outside sql_exporter from a gathering, storage, and processing perspective.

Jun 11 '25 17:06 MMeent

No, if you want to measure how often autovacuum runs, you're probably better off summing up the values of pg_stat_all_tables.autovacuum_count for all tables of all databases.

Forgive me if I'm wrong, but other than being an easier to read/interpret metric, how would this be better from an efficiency perspective?

Because that removes the need for a separate metric labelset for every table, because you're aggregating all the data of the cluster (or, maybe, database) in just one value, rather than one per table.

Jun 11 '25 17:06 MMeent

Flux Fleet PR which adds pg_autovacuum_stats to metrics that have to be collected: https://github.com/neondatabase/flux-fleet/pull/660

Jun 11 '25 21:06 thesuhas

Closing for now as connection to non-postgres DBs via dblink still counts as activity and prevents compute suspend.

Jun 23 '25 18:06 thesuhas