postgres_exporter icon indicating copy to clipboard operation
postgres_exporter copied to clipboard

--collector.stat_wal_receiver does not run if replication is setup without replication slot

Open teleschu opened this issue 5 months ago • 3 comments

What did you do? Setup replication without a replication slot.

What did you expect to see? pg_stat_walreceiver metrics (with an empty slot_name)

What did you see instead? Under which circumstances? postgres_exporter 1e [daemon.info] time=2025-09-15T16:57:19.350Z level=DEBUG source=pg_stat_walreceiver.go:164 msg="Skipping wal receiver stats because slotname host is null" collector=stat_wal_receiver

Environment Rocky 9

  • System information: Linux 5.14.0-570.23.1.el9_6.x86_64 x86_64

  • postgres_exporter version: postgres_exporter, version 0.17.1 (branch: HEAD, revision: 1e574cf4fd2a75a8a707d424eafcaa0b88cb7af4) build user: root@b0ba132eddb0 build date: 20250226-13:53:15 go version: go1.23.6 platform: linux/amd64 tags: unknown

  • postgres_exporter flags:

--config.file=/etc/postgres_exporter/postgres_exporter.yml
--log.level=debug
--collector.long_running_transactions
--collector.postmaster
--collector.stat_activity_autovacuum
--collector.stat_checkpointer
--collector.stat_statements
--collector.statio_user_indexes
--collector.stat_wal_receiver
  • PostgreSQL version: PostgreSQL 17.5 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5), 64-bit

  • Logs:

Sep 15 19:15:19 rd-r9-ds-5-u2 postgres_exporter 1e [daemon.info] time=2025-09-15T17:15:19.330Z level=DEBUG source=namespace.go:192 msg="Querying namespace" namespace=pg_stat_wal_receiver_schu
SSep 15 19:15:19 rd-r9-ds-5-u2 postgres_exporter 1e [daemon.info] time=2025-09-15T17:15:19.355Z level=DEBUG source=pg_stat_walreceiver.go:164 msg="Skipping wal receiver stats because slotname host is null" collector=stat_wal_receiver
Sep 15 19:15:19 rd-r9-ds-5-u2 postgres_exporter 1e [daemon.info] time=2025-09-15T17:15:19.355Z level=DEBUG source=collector.go:205 msg="collector succeeded" name=stat_wal_receiver duration_seconds=0.023864803

teleschu avatar Sep 15 '25 17:09 teleschu

The related function is in file: collector/pg_stat_walreceiver.go func: Update()

            if !slotName.Valid {
                    c.log.Debug("Skipping wal receiver stats because slotname host is null")
                    continue
            }

teleschu avatar Sep 15 '25 17:09 teleschu

I'm hitting a similar error:

Skipping wal receiver stats because upstream host is null

.. getting tripped up a few lines earlier, here: https://github.com/prometheus-community/postgres_exporter/blob/105c422dfb3222940ced0b00023225866a86ebee/collector/pg_stat_walreceiver.go#L158-L161

.. instead of where you get tripped up, here: https://github.com/prometheus-community/postgres_exporter/blob/105c422dfb3222940ced0b00023225866a86ebee/collector/pg_stat_walreceiver.go#L163-L166

I suppose these may be related.


Digging a bit more into my issue, it seems like pgStatWalReceiverQueryTemplate may be to blame.

Over here, I think we're in the hasFlushedLSN branch: https://github.com/prometheus-community/postgres_exporter/blob/105c422dfb3222940ced0b00023225866a86ebee/collector/pg_stat_walreceiver.go#L131-L135

.. and that injecting "(flushed_lsn - '0/0') % (2^52)::bigint as flushed_lsn,\n" into the query likely makes it invalid somehow.

My replication host is definitely set.

I cleaned up pgStatWalReceiverQueryTemplate (replacing %s with an empty string and have de-escaping %% to % and ''''' to ''), ending up with:

SELECT
    trim(both '' from substring(conninfo from 'host=([^ ]*)')) as upstream_host,
    slot_name,
    status,
    (receive_start_lsn- '0/0') % (2^52)::bigint as receive_start_lsn,
    (flushed_lsn - '0/0') % (2^52)::bigint as flushed_lsn,
receive_start_tli,
    received_tli,
    extract(epoch from last_msg_send_time) as last_msg_send_time,
    extract(epoch from last_msg_receipt_time) as last_msg_receipt_time,
    (latest_end_lsn - '0/0') % (2^52)::bigint as latest_end_lsn,
    extract(epoch from latest_end_time) as latest_end_time,
    substring(slot_name from 'repmgr_slot_([0-9]*)') as upstream_node
FROM pg_catalog.pg_stat_wal_receiver;

Running this manually against my Postgres instance, I get:

 upstream_host |         slot_name          |  status   | receive_start_lsn | flushed_lsn | receive_start_tli | received_tli | last_msg_send_time | last_msg_receipt_time | latest_end_lsn |  latest_end_time  | upstream_node 
---------------+----------------------------+-----------+-------------------+-------------+-------------------+--------------+--------------------+-----------------------+----------------+-------------------+---------------
 10.20.30.40   | something_something_here   | streaming |         369098752 |   369099080 |                 1 |            1 |  1758730951.776120 |     1758730951.763306 |      369099080 | 1758728155.319933 | 
(1 row)

.. which demonstrates that upstream_host is OK.

spantaleev avatar Sep 24 '25 16:09 spantaleev

TLDR:

  • I was wrong about there being a bug in SQL query construction from pgStatWalReceiverQueryTemplate. It's OK.
  • my issue is not related to the issue discussed here. I was running this Postgres Exporter with an underprivileged user, so it was rightfully returning empty column data for upstream_host (and all other columns) and couldn't proceed. I think reporting this error ("Skipping wal receiver stats because upstream host is null") could have been done better though (as discussed below).

I was thinking there's a bug in this Postgres exporter, but.. it looks like there isn't one.

It turns out that the user that you monitor with needs to:

  • either be a superuser
  • or be privileged enough (having the pg_monitor GRANT)

... so that it could yield a non-empty result for this query (coming from pgStatWalReceiverQueryTemplate):

SELECT
    trim(both '' from substring(conninfo from 'host=([^ ]*)')) as upstream_host,
    slot_name,
    status,
    (receive_start_lsn- '0/0') % (2^52)::bigint as receive_start_lsn,
    (flushed_lsn - '0/0') % (2^52)::bigint as flushed_lsn,
receive_start_tli,
    received_tli,
    extract(epoch from last_msg_send_time) as last_msg_send_time,
    extract(epoch from last_msg_receipt_time) as last_msg_receipt_time,
    (latest_end_lsn - '0/0') % (2^52)::bigint as latest_end_lsn,
    extract(epoch from latest_end_time) as latest_end_time,
    substring(slot_name from 'repmgr_slot_([0-9]*)') as upstream_node
FROM pg_catalog.pg_stat_wal_receiver;

If the user is underprivileged, the query would still return 1 row, but all columns would be empty.

I think the current documentation (is there any for this collector?) could be improved to mention this requirement.

I also think that "all columns being empty" is a sign that the user is underprivileged. It would probably be better to:

  • show a more user-friendly error, saying that a superuser or a user with a pg_monitor GRANT is required
  • show this as an error at a level higher than debug. Right now, it's failing rather silently. Perhaps the collector (while initializing) could check its grants somehow or infer that the pg_monitor permission is missing from the empty result set and fail with a very visible error message

It looks like the issue reported by you is valid - the collector fails if reservation slots are not used (slot_name is empty).

Given that it didn't fail a few lines earlier (when checking if upstream_host is empty) gives me the impression that you are running Postgres exporter with a privileged-enough user (unlike me), so your issue is a different one than mine.

spantaleev avatar Sep 25 '25 10:09 spantaleev