pgmonitor icon indicating copy to clipboard operation
pgmonitor copied to clipboard

postgres exporter errors - database is closed

Open cassanellicarlo opened this issue 1 year ago • 3 comments

Describe the issue:

I'm running a PostgresCluster with pgo operator v5.5.0. There are errors in the exporter container. Metrics cannot be scraped by prometheus - target is down.

Error running query on database - database is closed postgres_exporter_watcher: line 24: 146 Killed postgres_exporter "${postgres_exporter_flags[@]}"

Logs

ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_incr_backup sql: database is closed"
ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_settings_pending_restart sql: database is closed"
ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_database_size sql: database is closed"
ts=2024-09-02T12:10:26.847Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_diff_backup sql: database is closed"
ts=2024-09-02T12:10:26.849Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 40 errors"
ts=2024-09-02T12:10:26.848Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_wal sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_activity sql: database is closed"
ts=2024-09-02T12:10:26.847Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_nodemx_cpustat sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_replication_lag sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_data_checksum_failure sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_diff_backup sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_archiver sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_replication sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_nodemx_process sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_replication_slots sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_incr_backup sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_settings_pending_restart sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_table_size sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_database_size sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_wal sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_connection_stats sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": pg_stat_activity sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_total sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_replication_lag sql: database is closed"
ts=2024-09-02T12:10:26.931Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_full_backup sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_postgresql_version sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 46 errors"
ts=2024-09-02T12:10:26.932Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_data_checksum_failure sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_backrest_last_diff_backup sql: database is closed"
ts=2024-09-02T12:10:26.932Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 37 errors"
ts=2024-09-02T12:10:27.845Z caller=postgres_exporter.go:682 level=error err="Error opening connection to database (postgresql://ccp_monitoring:PASSWORD_REMOVED@localhost:5432/postgres): pq: the database system is in recovery mode"
ts=2024-09-02T12:10:27.845Z caller=server.go:74 level=info msg="Established new database connection" fingerprint=localhost:5432
ts=2024-09-02T12:10:27.856Z caller=postgres_exporter.go:613 level=info msg="Semantic version changed" server=localhost:5432 from=0.0.0 to=14.10.0
postgres_exporter_watcher: line 24:   146 Killed                  postgres_exporter "${postgres_exporter_flags[@]}"

If i try to call the metrics endpoint, i get the following logs

An error has occurred while serving metrics:

86 error(s) occurred:
* collected metric "pg_statio_user_tables_heap_blocks_read" { label:{name:"datname"  value:"postgres"}  label:{name:"relname"  value:"pgbackrest_info"}  label:{name:"schemaname"  value:"monitor"}  counter:{value:0}} was collected before with the same name and label values
* collected metric "pg_statio_user_tables_heap_blocks_hit" { label:{name:"datname"  value:"postgres"}  label:{name:"relname"  value:"pgbackrest_info"}  label:{name:"schemaname"  value:"monitor"}  counter:{value:0}} was collected before with the same name and label values

[...]

Describe the expected behavior:

pgo exporter works well and prometheus can scrape metrics.

Tell us about your environment:

  • OpenShift version: 4.14.33
  • pgMonitor version:
  • Container or non-container: container
  • Container name / image: registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.5.0-0
  • PostgreSQL Version: 14
  • Exporter(s) in use (incl. version): https://registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-0.15.0-0
  • Prometheus version: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c896664ee140743667e9860ca07824ec7a37b5a557a2c36a7f133a1c3f1733e

What could be the issue? Do you have an idea?

Thanks.

cassanellicarlo avatar Sep 04 '24 09:09 cassanellicarlo

Not quite sure what the database is closed error is.

The was collected before with the same name and label values error is mostly self explanatory in that it's saying that this metric was already collected before. Can be tricky to narrow down why tho. Usually in the case of table statistics like this, it's because it's missing the database name label when trying to collect them from multiple databases. But I do see the database name label there. Perhaps the same database name got added in multiple times to the connection configuration.

@andrewlecuyer any ideas here how these issues crept up in the container?

keithf4 avatar Sep 04 '24 14:09 keithf4

@cassanellicarlo what is the current status of this Postgres Pod? Is it up-and-running, and ready?

Considering the following error:

Error opening connection to database (postgresql://ccp_monitoring:PASSWORD_REMOVED@localhost:5432/postgres): pq: the database system is in recovery mode"

It looks like the exporter is simply unable to connect to the database. What do the Postgres logs show for this PG instance? My hunch here is that the issue is with Postgres itself, and the exporter is simply unable to connect.

andrewlecuyer avatar Sep 04 '24 14:09 andrewlecuyer

@andrewlecuyer Postgres is running. The instance is 5/5 Ready and Running.

These are the logs

2024-09-04 14:54:17,796 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:27,787 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:37,798 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:47,790 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:54:57,822 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:55:07,789 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:55:17,791 INFO: no action. I am (pg-instance-0), the leader with the lock
2024-09-04 14:55:27,805 INFO: no action. I am (pg-instance-0), the leader with the lock

Previous logs:

2024-09-02 12:45:37,640 INFO: Lock owner: ; I am pg-instance-0
2024-09-02 12:45:37,641 INFO: starting as a secondary
2024-09-02 12:45:38.526 UTC [2029083] LOG:  pgaudit extension initialized
2024-09-02 12:45:38,541 INFO: postmaster pid=2029083
/tmp/postgres:5432 - no response
2024-09-02 12:45:38.639 UTC [2029083] LOG:  redirecting log output to logging collector process
2024-09-02 12:45:38.639 UTC [2029083] HINT:  Future log output will appear in directory "log".
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - rejecting connections
/tmp/postgres:5432 - accepting connections
2024-09-02 12:45:42,671 INFO: establishing a new patroni connection to the postgres cluster
2024-09-02 12:45:42,721 INFO: My wal position exceeds maximum replication lag
2024-09-02 12:45:42,814 INFO: following a different leader because i am not the healthiest node
2024-09-02 12:45:52,671 INFO: My wal position exceeds maximum replication lag
2024-09-02 12:45:52,735 INFO: following a different leader because i am not the healthiest node
2024-09-02 12:46:02,672 INFO: My wal position exceeds maximum replication lag
2024-09-02 12:46:02,721 INFO: following a different leader because i am not the healthiest node
2024-09-02 12:46:13,912 INFO: Got response from <omitted> <omitted>: {"state": "starting", "role": "replica", "cluster_unlocked": true, "dcs_last_seen": 1725281167, "database_system_identifier": "<omitted>", "patroni": {"version": "3.1.1", "scope": "---}}
2024-09-02 12:46:13,912 WARNING: Request failed to pg-instance-0: GET <omitted> ('xlog')
2024-09-02 12:46:14,252 INFO: Got response from <omitted> https://<omitted>:8008/patroni: {"state": "starting", "role": "replica", "cluster_unlocked": true, "dcs_last_seen": 1725281167, "database_system_identifier": "<omitted>", "patroni": {"version": "3.1.1", "scope": "<omitted>"}}
2024-09-02 12:46:14,252 WARNING: <omitted>: GET https://<omitted>:8008/patroni ('xlog')
2024-09-02 12:46:14,329 INFO: promoted self to leader by acquiring session lock
server promoting
2024-09-02 12:46:15,410 INFO: no action. I am (pg-instance-0), the leader with the lock

Seems like the exporter got stuck when the database was not available (2024-09-02) / during a failover and then it couldn't recover anymore (postgres_exporter_watcher: line 24: 146 Killed).

cassanellicarlo avatar Sep 04 '24 15:09 cassanellicarlo

Just checking to see if you're still having this issue anymore?

keithf4 avatar Jan 08 '25 16:01 keithf4

@keithf4 I checked and we still have issues:

ts=2025-01-09T12:12:48.787Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_settings_checksum pq: cannot open relation \"pg_settings_hash\""
ts=2025-01-09T12:12:48.822Z caller=namespace.go:236 level=info err="Error running query on database \"localhost:5432\": ccp_pg_hba_checksum pq: cannot open relation \"pg_hba_hash\""
ts=2025-01-09T12:12:48.884Z caller=postgres_exporter.go:682 level=error err="queryNamespaceMappings returned 2 errors"

cassanellicarlo avatar Jan 09 '25 12:01 cassanellicarlo

This looks like something still isn't getting set up properly. @andrewlecuyer

keithf4 avatar Jan 09 '25 14:01 keithf4

Apologies for delays in getting back to you on this. Are you still having these issues?

keithf4 avatar Feb 28 '25 16:02 keithf4

Closing this issue for now. Please feel free to re-open at any time if you still need assistence.

keithf4 avatar May 08 '25 16:05 keithf4