neon
neon copied to clipboard
Improve walreceiver logic
- It looks like
etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn())
filtered all safekeepers in some strange cases. I removed this filter, it should probably fix #2237 - Now walreceiver_connection reports status, including commit_lsn. This allows to keep safekeeper connection even when etcd is down.
- Safekeeper connection now fails if pageserver doesn't receive sk messages for some time. Usually safekeeper sends messages at least once per second.
-
LaggingWal
check now usescommit_lsn
directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast. -
NoWalTimeout
is rewritten to trigger only when we know about new WAL and connected safekeeper doesn't stream any WAL. This allows to set smalllagging_wal_timeout
, because it will trigger only when we observe that connected safekeeper has stuck.
TODO: fix tests