neon icon indicating copy to clipboard operation
neon copied to clipboard

Improve walreceiver logic

Open petuhovskiy opened this issue 1 year ago • 0 comments

  • It looks like etcd_info.timeline.commit_lsn > Some(self.local_timeline.get_last_record_lsn()) filtered all safekeepers in some strange cases. I removed this filter, it should probably fix #2237
  • Now walreceiver_connection reports status, including commit_lsn. This allows to keep safekeeper connection even when etcd is down.
  • Safekeeper connection now fails if pageserver doesn't receive sk messages for some time. Usually safekeeper sends messages at least once per second.
  • LaggingWal check now uses commit_lsn directly from safekeeper. This fixes the issue with often reconnects, when compute generates WAL really fast.
  • NoWalTimeout is rewritten to trigger only when we know about new WAL and connected safekeeper doesn't stream any WAL. This allows to set small lagging_wal_timeout, because it will trigger only when we observe that connected safekeeper has stuck.

TODO: fix tests

petuhovskiy avatar Aug 11 '22 12:08 petuhovskiy