postgres-operator
postgres-operator copied to clipboard
Exporter Logging Errors for nodemx disk metrics and ccp_pg_stat_statements_reset
Please ensure you do the following when reporting a bug:
- [ ] Provide a concise description of what the bug is.
- [ ] Provide information about your environment.
- [ ] Provide clear steps to reproduce the bug.
- [ ] Attach applicable logs. Please do not attach screenshots showing logs unless you are unable to copy and paste the log data.
- [ ] Ensure any code / output examples are properly formatted for legibility.
Note that some logs needed to troubleshoot may be found in the /pgdata/<CLUSTERNAME>/pg_log
directory on your Postgres instance.
An incomplete bug report can lead to delays in resolving the issue or the closing of a ticket, so please be as detailed as possible.
If you are looking for general support, please view the support page for where you can ask questions.
Thanks for reporting the issue, we're looking forward to helping you!
Overview
Upon upgrading PGO to 4.7.4 and upgrading a pg cluster. The metrics container for the cluster began logging errors:
"Error running query on database "127.0.0.1:5432": ccp_pg_stat_statements_reset pq: function monitor.pg_stat_statements_reset_info(integer) does not exist"
"Error running query on database "127.0.0.1:5432": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible"
Environment
Please provide the following details:
- Platform: Kubernetes
- Platform Version: 4.7.4
- PGO Image Tag: postgres-operator:centos8-4.7.4
- Postgres Version: 13.5
- Storage: rook-ceph
Steps to Reproduce
REPRO
Provide steps to get to the error condition: Have existing Prometheus Deployment Deploy PGO 4.6.2 Do NOT Deploy pgmonitor Create a cluster with metrics Ensure metrics are reaching the existing Prometheus instance should not be generating psql_scrape_errors yet Upgrade PGO to 4.7.4 Upgrade the pg cluster
EXPECTED
No scrape errors were expected. All metrics are going into non-pgmonitor Prometheus.
ACTUAL
Metrics are still reaching the non-pgmonitor Prometheus, but the exporter is now failing to retrieve some metrics with the following errors:
"Error running query on database "127.0.0.1:5432": ccp_pg_stat_statements_reset pq: function monitor.pg_stat_statements_reset_info(integer) does not exist"
"Error running query on database "127.0.0.1:5432": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible"
Logs
See above
Did you run ALTER EXTENSION pgnodemx UPDATE;
as a superuser in each of your applicable databases?
Did you run
ALTER EXTENSION pgnodemx UPDATE;
as a superuser in each of your applicable databases?
Thanks for the suggestion. I tried that but got the following error:
ERROR: extension "pgnodemx" does not exist
I listed the extensions:
\dx+
Objects in extension "pg_stat_statements"
Object description
---------------------------------------------------
function pg_stat_statements(boolean)
function pg_stat_statements_reset(oid,oid,bigint)
view pg_stat_statements
(3 rows)
Objects in extension "pgaudit"
Object description
---------------------------------------
event trigger pgaudit_ddl_command_end
event trigger pgaudit_sql_drop
function pgaudit_ddl_command_end()
function pgaudit_sql_drop()
(4 rows)
Objects in extension "plpgsql"
Object description
-------------------------------------------
function plpgsql_call_handler()
function plpgsql_inline_handler(internal)
function plpgsql_validator(oid)
language plpgsql
@kyle-rf I have been unable to reproduce the behavior you are seeing. Specifically, when I follow the steps you provided (i.e. to upgrade from v4.6.2 to v4.7.4), I see the pgnodemx
extension both before and after the upgrade.
And since the upgrade process does not touch the database itself, the extension should definitely be there both before and after the upgrade (assuming metrics was properly enabled/deployed prior to the upgrade). So just to clarify - per the output you provided above, are you seeing the pgnodemx
before the PGO and PG cluster upgrade to 4.7.4, but not after?
Additionally, I did want to suggest another approach. Specifically, if you remove metrics from a cluster:
$ pgo update cluster mycluster1 --disable-metrics
Adding or removing a metrics collection sidecar can cause downtime.
WARNING: Are you sure? (yes/no): yes
updated pgcluster mycluster1
...and then re-enable metrics, e.g.:
$ pgo update cluster mycluster1 --enable-metrics
Adding or removing a metrics collection sidecar can cause downtime.
WARNING: Are you sure? (yes/no): yes
updated pgcluster mycluster1
This should resolve the error you are seeing by ensuring all of the latest pgMonitor resources are created and/or updated (with the second step specifically ensuring metrics collection [and therefore the pgnodemx
extension] is properly configured and enabled for the cluster).
@andrewlecuyer I gave that a try, but exporter is still logging the error msg="Error running query on database \"127.0.0.1:5432\": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible"
I have not seen the other error since though, so halfway there.
Context
- Deployed 4.x PGO and a postgres instance
- Upgraded PGO to 5.0.x then 5.1
- The new PostgresCluster CR is now using the latest exporter:
registry.developers.crunchydata.com/crunchydata/crunchy-postgres-exporter:ubi8-5.1.0-0
Problem
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_total pq: column s.total_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_mean pq: column s.mean_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_total pq: column s.total_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_max pq: column s.max_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_data_disk pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-05-13T12:19:42Z" level=error msg="queryNamespaceMappings returned 6 errors" source="postgres_exporter.go:1474"
Investigation and workaround
- Did some research into the source code of the operator and these repos
- Dropped pgnodemx extension and eventually just recreated it like the operator would: https://github.com/CrunchyData/postgres-operator/blob/fb5e4f0a5dac1e301106e61e7dd6ff252a0fdd4d/internal/pgmonitor/postgres.go#L131
- Exporter now publishes ccp_nodemx_disk* correctly, but still has some errors:
time="2022-05-13T12:31:12Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_mean pq: column s.mean_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:31:12Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_total pq: column s.total_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:31:12Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_total pq: column s.total_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:31:12Z" level=info msg="Error running query on database \"localhost:5432\": ccp_pg_stat_statements_top_max pq: column s.max_exec_time does not exist" source="postgres_exporter.go:1356"
time="2022-05-13T12:31:12Z" level=error msg="queryNamespaceMappings returned 4 errors" source="postgres_exporter.go:1474"
Later I found this thread and noticed ALTER EXTENSION pgnodemx UPDATE;
But isn't the operator supposed to do this?
The https://github.com/CrunchyData/postgres-operator/pull/3400 should resolve these issues and is planned to be available in the next release, but not currently planned to be backpatched to v4. As noted, the manual fix should be to update the extensions used by the exporter, namely pgnodemx
and pg_stat_statements
.
@benjaminjb Sorry for writing in a closed issue, but we are seeing similar errors in our operator based openshift installation of crunchy. Operator version 5.2.0. - postgresoperator.v5.2.0.
time="2022-12-08T09:23:56Z" level=info msg="Established new database connection to \"localhost:5432\"." source="postgres_exporter.go:878"
time="2022-12-08T09:23:57Z" level=info msg="Semantic Version Changed on \"localhost:5432\": 0.0.0 -> 14.5.0" source="postgres_exporter.go:1405"
time="2022-12-08T09:24:01Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-12-08T09:24:02Z" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1474"
time="2022-12-08T09:24:02Z" level=info msg="Starting Server: :9187" source="postgres_exporter.go:1672"
time="2022-12-08T09:24:03Z" level=info msg="Error running query on database \"localhost:5432\": ccp_nodemx_disk_activity pq: query-specified return tuple and function return type are not compatible" source="postgres_exporter.go:1356"
time="2022-12-08T09:24:03Z" level=error msg="queryNamespaceMappings returned 1 errors" source="postgres_exporter.go:1474"
Should this be fixed in our operator version? I have tried to find out in which operator version the fix is included, but unfortunally i was not able to match the repo tags to an operator version. Maybe you can give me a hint how to match the repo tags to an operator version.
The ALTER EXTENSION pgnodemx UPDATE;
is able to fix the issue, but as mentioned before, this should be done by the operator.
Hey @dzabel this won't be fixed in 5.2.0, but should be fixed in 5.3.0, which is coming soon -- but I don't know how soon. I'll drop a note here as soon as it is, just so you know.
Hey, just wanted to drop a line and say that 5.3.0 is out for our open source users -- hope that helps!