redpanda Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index"

trafficstars

In the write caching design we made the assumption that last stable offset is always equal-or-greater than the committed offset because when transactions are in use, we always wait for the commands to be applied to the STMs, and commands below committed offset are never applied.

The archival/tiered storage correctness assumption builds on that. Because we upper-bound the tiered storage data to the last stable offset, it should never see suffix truncations caused by data loss. Tiered storage doesn't have a concept of suffix truncation so if that would happen it would lead violations of correctness properties and diverging logs/undefined behavior.

https://github.com/redpanda-data/redpanda/blob/2823c0919ffcc13e0357632958c866bcb6fb42f3/src/v/cluster/partition.cc#L1248-L1254

Now, after unrelated auditing of the code base, I have discovered that we didn't account for what happens when transactions are disabled, i.e. rm_stm is not present on a partition. Last Stable Offset is defined to be equivalent to the High Watermark.

That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.

What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.

This would keep the partition interface lean without the need of introducing another offset at this layer. It might come as a bit of a surprise that LSO doesn't match HWM when no transactions are in use. But it is already like this today when write caching is in use. Or when rm_stm lags behind applying commands (ref https://github.com/redpanda-data/redpanda/blob/88ac775f9f7954330732024abfa6e9ed5c9c11fd/src/v/cluster/rm_stm.cc#L1293).

JIRA Link: CORE-2766

May 03 '24 13:05 nvartolomei

cc @bharathv

May 03 '24 13:05 nvartolomei

when you mean write caching, I assume you are using it as an umbrella term for acks=0/1 and actual write caching with acks=all?

That might cause problems when write caching is in use. One possible fix is to bound the archival to the minimum between last stable offset and raft committed offset. All other potential users of LSO should also keep this in mind.

TIL, that seems sketchy.. I don't fully understand that code but what happens when there is a truncation today during write caching? I think always using raft committed offset (like how stms do) seems less controversial to me. Even with rm_stm, it still returns high watermark equivalent if there is no acks=all data, that doesn't pose any problems?

What if we instead redefine what LSO means when write caching is in use regardless of whether rm stm exists? We can make it always return the least between the traditional LSO (below undecided transactions) and raft committed offset.

I think that slows down consumers of acks=0/1 data. They have to wait for data to be raft committed until they consume, I remember changing it at some point and a bunch of acks=0/1 tests failed.

May 03 '24 23:05 bharathv

@Lazin, please read this thread. tl;dr; I believe we need to define a "safe_to_upload" offset method in cloud storage which is max between LSO and committed offset. Nothing guarantees that LSO >= raft committed offset but it seems that cloud storage correctness assumes that (wrongly).

May 07 '24 12:05 nvartolomei

the assumption was correct when it was introduced, because of that it might not be the only place where we make it

May 07 '24 19:05 Lazin

redpanda redpanda copied to clipboard

Proposal to change the semantics of the last stable offset to always mean committed offset rather than "raft last visible index"

redpanda
redpanda copied to clipboard