postgres_exporter icon indicating copy to clipboard operation
postgres_exporter copied to clipboard

Have 0 lag if database is not updated

Open angaz opened this issue 3 years ago • 5 comments

I found the lag metric increasing over night when there are no commits in the database. Setting the lag to 0 when the pg_last_wal_receive_lsn and pg_last_wal_replay_lsn values are equal seems to have solved that problem.

angaz avatar Jun 14 '21 15:06 angaz

We should probably fix these queries such that they expose the metrics as timestamps, rather than delta gauges. This would make it possible to calculate things correctly in PromQL.

SuperQ avatar Jul 28 '22 08:07 SuperQ

Actually, I think this is incorrect. What happens if the replica is behind because it's disconnected or replication is broken in some way. Wouldn't this produce a false negative, as in, 0 when it could be lagged?

SuperQ avatar Jul 28 '22 08:07 SuperQ

Actually, I think this is incorrect. What happens if the replica is behind because it's disconnected or replication is broken in some way. Wouldn't this produce a false negative, as in, 0 when it could be lagged?

What do you recommend we change it to? Is it even possible to detect such situations?

angaz avatar Jul 28 '22 18:07 angaz

Sorry, I'm no expert in Postgres. I mostly just handle the Go code here.

However, I am a MySQL expert.

In MySQL Land, we use a replication heartbeat system. looks like they support Postgres now.

https://www.percona.com/doc/percona-toolkit/LATEST/pt-heartbeat.html

SuperQ avatar Jul 28 '22 18:07 SuperQ

Sorry, I'm no expert in Postgres. I mostly just handle the Go code here.

However, I am a MySQL expert.

In MySQL Land, we use a replication heartbeat system. looks like they support Postgres now.

https://www.percona.com/doc/percona-toolkit/LATEST/pt-heartbeat.html

Ok. Cool. These are the only values available to us for knowing about replication lag. https://www.postgresql.org/docs/current/functions-admin.html#FUNCTIONS-RECOVERY-INFO-TABLE

In this case, we are looking at a timestamp value, and it cannot be in the future, so the lag time in seconds is not possible to be negative, and the *_lsn values are a special Postgres thing. It doesn't have a specific meaning that we can get any information from to improve the value of the metrics.

So I don't think there is anything that we can improve with this metric.

angaz avatar Jul 28 '22 19:07 angaz