check_postgres icon indicating copy to clipboard operation
check_postgres copied to clipboard

Explanation of time_delay= in hot_standby_delay

Open majales opened this issue 8 years ago • 6 comments

Hi, I have just setup check_postgres plugin in icinga and I came to output which I don't understand

check_postgres.pl --action hot_standby_delay  --dbuser=icinga,icinga  --dbhost=master,slave -w 5 -c 10
POSTGRES_HOT_STANDBY_DELAY OK: DB "postgres" (host:10.33.0.23) 0 and 200 seconds | time=0.06s replay_delay=0;1;1 receive-delay=0;1;1  time_delay=200;

What actually means that "time_delay=200" and "0 and 200 seconds" ? It changes (grows) with time.

I have checked on master that replication should be running in sync mode

postgres=# select usesysid, usename, application_name, client_addr, state, sent_location,write_location,sync_state from pg_stat_replication;
 usesysid |  usename  | application_name | client_addr |   state   | sent_location | write_location | sync_state
----------+-----------+------------------+-------------+-----------+---------------+----------------+------------
    16384 | repl_user | replica1         | 10.22.0.33  | streaming | 0/23F94C20    | 0/23F94C20     | sync

this is how my recovery.conf on slave looks like:

standby_mode = 'on'
primary_conninfo = 'user=repl_user password=verysecretpass host=10.22.0.13 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres application_name=replica1'
primary_slot_name = 'standby1'

Please could someone explain me what it means?

majales avatar Aug 18 '17 09:08 majales

I saw you've asked the same question on https://www.claudiokuenzler.com/blog/723/how-to-monitor-postgresql-replication :)

I went through the source code and time_delay is $time_delta:

 if ($version >= 9.1) {
        $db->{perf} .= sprintf ' %s=%s;%s;%s',
            perfname(msg('hs-time-delay')), $time_delta, $wtime, $ctime;
 }

time_delta is read from xlog positions (line 5110 and below):

## Get xlog positions
my ($moffset, $s_rec_offset, $s_rep_offset, $time_delta);

## On slave
$SQL = q{SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay}
if ($version >= 9.1) {
    $SQL .= q{, COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds};
}
my $info = run_command($SQL, { dbnumber => $slave, regex => qr/\// });
my $saved_db;
for $db (@{$info->{db}}) {
    my $receive = $db->{slurp}[0]{receive};
    my $replay = $db->{slurp}[0]{replay};
$time_delta = $db->{slurp}[0]{seconds};

So if you run the SQL command you see these "seconds":

postgres=# SELECT pg_last_xlog_receive_location() AS receive, pg_last_xlog_replay_location() AS replay , COALESCE(ROUND(EXTRACT(epoch FROM now() - pg_last_xact_replay_timestamp())),0) AS seconds;
  receive   |   replay   | seconds 
------------+------------+---------
 2/266055D8 | 2/266055D8 |  151763
(1 row)

So seconds is defined by the function pg_last_xact_replay_timestamp(), which is, according to the documentation (https://www.postgresql.org/docs/9.4/static/functions-admin.html:

Get time stamp of last transaction replayed during recovery. This is the time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL.

Napsty avatar Aug 19 '17 11:08 Napsty

So if I understand it correctly, when this time_delay value is increasing then hot_standby postgresql server is not in sync with master server. So should it trigger WARNING and CRITICAL for this check? Because I had more the time_delay 10000 seconds and check was stil in state OK. When I restarted postgresql on slave server it become 0 with some receive-delay in a meantime but nothing has triggered WARNING or CRITICAL.

majales avatar Aug 19 '17 13:08 majales

So if I understand it correctly, when this time_delay value is increasing then hot_standby postgresql server is not in sync with master server.

No, the value is the "time passed since last sync operation". It doesn't mean that the servers are out of sync then.

Practical example. I'm getting the pg_last_xact_replay_timestamp() on the slave server:

postgres=# SELECT pg_last_xact_replay_timestamp();
 pg_last_xact_replay_timestamp 
-------------------------------
 2017-08-17 19:05:17.074283+02
(1 row)

So the last sync operation happened on August 17th 2017 at 19:05. Let me check the databases on the slave:

postgres=# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 mydb      | myuser    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)

On the master I create a new database:

postgres=# CREATE DATABASE claudiotest;
CREATE DATABASE

And on the slave I check if this new database was added:

postgres=# \l
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 claudiotest | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 mydb      | myuser    | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(5 rows)

Yes, it was. And now I verify the timestamp again:

postgres=# SELECT pg_last_xact_replay_timestamp();
 pg_last_xact_replay_timestamp 
-------------------------------
 2017-08-21 08:06:59.280348+02
(1 row)

The timestamp has changed, because a sync operation happened.

Now when the check_postgres.pl plugin is launched again, it will show the seconds passed since that last sync operation:

nagios@slave:~$ date; /usr/lib/nagios/plugins/check_postgres.pl -H localhost,master -u monitoring -db mydb --action hot_standby_delay --warning 60 --critical 600
Mon Aug 21 08:14:54 CEST 2017
POSTGRES_HOT_STANDBY_DELAY OK: DB "mydb" (host:localhost) 0 and 475 seconds | time=0.05s replay_delay=0;60;600  receive-delay=0;60;600 time_delay=475; 

Napsty avatar Aug 21 '17 06:08 Napsty

@Napsty thank you for clarifying this. I was looking for a way to monitor if the slave is behind the master. MySQL for example has a seconds_behind_master option. This way I can setup warning/critical thresholds if the slave is N minutes behind the master (out of sync). I was hoping this would be possible with PostgreSQL and the hot_standby_delay monitoring function.

tomwijnroks avatar Apr 25 '18 16:04 tomwijnroks

@cognegic: Please don't post spam.

df7cb avatar May 04 '18 11:05 df7cb

I got a bit lost. Why the time passed from the last operation, should be passed to the monitoring check? It should be a a metric number and not a trigger for an alert.

maxadamo avatar Oct 22 '19 16:10 maxadamo