check_postgres icon indicating copy to clipboard operation
check_postgres copied to clipboard

slony_status does not check all slaves of a cluster

Open klaus3000 opened this issue 11 years ago • 0 comments

Hi! If I understand the script correctly, the slony-status checks only 1 slave of a cluster, the one which is returned first from the query. This is random. E.g. here a scenario where one slave is behind:

SELECT ROUND(EXTRACT(epoch FROM st_lag_time)) AS lagtime, st_origin, st_received, current_database() AS cd, COALESCE(n1.no_comment, '') AS com1, COALESCE(n2.no_comment, '') AS com2 FROM _regdnscluster.sl_status JOIN _regdnscluster.sl_node n1 ON (n1.no_id=st_origin) JOIN _regdnscluster.sl_node n2 ON (n2.no_id=st_received); lagtime | st_origin | st_received | cd | com1 | com2 ---------+-----------+-------------+--------+-------------+------------------ 67 | 1 | 3 | regdns | Master Node | regdev-tst2 node 1792 | 1 | 2 | regdns | Master Node | regdev-tst1 node (2 rows)

I would expect that the script reports ERROR as one of the nodes is behind, but it reports:

./check_postgres.pl --action=slony_status --schema=_regdnscluster --dbname=regdns --warning=300 --critical=600 POSTGRES_SLONY_STATUS OK: DB "regdns" schema:_regdnscluster Slony lag time: 68 (68 seconds) | time=0.08s 'regdns._regdnscluster Node 1(Master Node) -> Node 3(regdev-tst2 node)'=68;300;600

In my opinion, it should either check all slaves, or at least the slave with the highest lag. Here a proposed fix (ORDER BY lag DESC):

--- check_postgres.pl.orig 2013-12-09 09:49:57.000000000 +0000 +++ check_postgres.pl 2013-12-09 09:50:40.000000000 +0000 @@ -7418,7 +7418,8 @@ COALESCE(n2.no_comment, '') AS com2 FROM SCHEMA.sl_status JOIN SCHEMA.sl_node n1 ON (n1.no_id=st_origin) -JOIN SCHEMA.sl_node n2 ON (n2.no_id=st_received)}; +JOIN SCHEMA.sl_node n2 ON (n2.no_id=st_received) +ORDER BY lagtime DESC};

 my $maxlagtime = -1;

regards Klaus

klaus3000 avatar Dec 09 '13 09:12 klaus3000