yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[DocDB] WAL retention should not be based on timeout, it should be based on an OpId that is needed by various entities in the system.

Open rthallamko3 opened this issue 2 years ago • 1 comments

Jira Link: DB-3619

Description

WAL retention should not be based on timeout, it should be based on OpId needed by various entities in the system - RAFT quorum, Xcluster, index backfill, CDC etc.

Problem: Currently, we retain records for log_min_seconds_to_retain (default 900s). But this is just an artificial number. The primary underlying requirement is to ensure a leader can catch up its followers, up to follower_unavailable_considered_failed_sec (default 900s), which we explicitly keep in sync with WAL retention. We also need it on all followers just the same, as either of them can become a leader. But, in theory, we only need enough WAL entries to be able to catchup any fallen behind tablet peer... So in raft terms, we only need to retain raft entries higher than the “all replicated op id”, such that no matter who becomes the leader, it has all the entries necessary for any peer, no matter how behind they are. And when a new peer gets added and goes through RBS, it will get the same WALs as the leader, so this is still safe. This should help reduce the amount of data on disk In steady state, we retain a minimum of 2 segments. We probably only need 1! Under load, we can grow to a high number of segments, eg: in the recent #proj-tablet-splitting test, we had ~70 segments in the 15m retention window.. For index backfill, we will aggressively be loading data into the index table, leading to a high number of WAL segments as well. This would also help with RBS in general (less data to ship)

Proposal: We should move away from time based WAL retention to need based WAL retention - any entity that needs WAL to be retained, explicitly sets it and that intention is replicated too (not lost on leader changes etc).

Also, retryable_request_timeout_secs should be used to identify the oldest active request in WAL that needs to be retained, and not necessarily be translated to a strict time duration for which the WAL needs to be held. In a system that is making healthy progress and there are no active client requests older than say 10 seconds, then the WAL retention should move forward and not hold up retryable_request_timeout_secs worth of WAL, instead it should only hold up 10 seconds of log on that node.

rthallamko3 avatar Sep 22 '22 17:09 rthallamko3

One note, we already have retention by op id, for xcluster, but that currently only allows extending retention, beyond the default min time/segments.

bmatican avatar Sep 22 '22 18:09 bmatican