pgcopydb `replay_lag` from pg_stat_replication seem to incorrect with pgcopydb

This is what I get when I query the pg_stat_replication.

pid              | 457376
usesysid         | 16418
usename          | tsdbadmin
application_name | pgcopydb
client_addr      | xxxxxx
client_hostname  |
client_port      | 41728
backend_start    | 2023-10-30 07:24:19.846248+00
backend_xmin     |
state            | streaming
sent_lsn         | 16E/3BDE09E8
write_lsn        | 16E/39DDBE30
flush_lsn        | 16E/33FB6258
replay_lsn       | 16E/331DE690
write_lag        | 00:00:00.719659
flush_lag        | 00:00:10.871797
replay_lag       | 00:00:12.904724
sync_priority    | 0
sync_state       | async
reply_time       | 2023-10-30 07:43:41.489434+00

It shows that the replay_lag is negligible, however when I query the row count in source and target, but there is a huge difference in terms of # of rows.

source:

select (select count(*) from readings) + (select count(*) from diagnostics) + (select count(*) from tags);
-[ RECORD 1 ]------
?column? | 65529218

target

----------
 35831854
(1 row)

I'm wondering, is the replay_lag relative to write_lag or is it relative to pg_current_wal_lsn()? When I calculate the lag using pg_current_wal_lsn, the difference is huge.

> select pg_size_pretty(pg_current_wal_lsn() - flush_lsn) from pgcopydb.sentinel;
-[ RECORD 1 ]--+------
pg_size_pretty | 13 GB

This warrants the next question, what is the reliable method to monitor the replication lag when using pgcopydb?

Oct 30 '23 08:10 arajkumar

Now that #519 is in, is that issue still the same?

I believe pg_stat_replication should be the way to monitor replay lag, because the pgcopydb.sentinel table only sees LSN from a single database and not the whole system. Those 13 GB of lag could be relevant to other database changes, not to the one that pgcopydb is processing.

It might be that with your fix this issue doesn't happen anymore, or not in the same number of cases?

Oct 30 '23 13:10 dimitri

@dimitri, #519 only partially solves the problem.

Those 13 GB of lag could be relevant to other database changes, not to the one that pgcopydb is processing.

Actually, I had only one database in my test setup. Correct me if I'm wrong, even when we have multiple database, walsender will skip other databases.

I think the problem is pg_stat_replication seems to represent the walsender which is responsible for that particular replication slot. When the walsender couldn't keep up with the rate of change, pg_stat_replication seems to give incorrect picture.

Oct 30 '23 13:10 arajkumar

pgcopydb pgcopydb copied to clipboard

`replay_lag` from pg_stat_replication seem to incorrect with pgcopydb

source:

target

pgcopydb
pgcopydb copied to clipboard