cockroach
cockroach copied to clipboard
rangefeed: catch-up scan assertion may misfire on out-of-window intent
Describe the problem
The rangefeed catch-up scan has a bug. If we hit this code path (as we frequently do):
https://github.com/cockroachdb/cockroach/blob/480a3f07f01e158ddda7433f69806af036179fba/pkg/kv/kvserver/rangefeed/catchup_scan.go#L261-L267
Then the iterator may be positioned on an intent that is outside of the time window we're catching up on. This can happen if the current version is the last version for the key, and the next key is an intent belonging to a long-running (perhaps abandoned) version:
/a @ 100 <-- iterator (initially)
/b @ meta[ts=50] <-- iterator (after NextIgnoringTime)
/b @ 50
/c @ 200 <-- iterator (after Next)
However, the next iteration of the loop which encounters that intent implicitly assumes that the intent is in the time window, because it goes to the next key expecting that to be the version referenced by the intent:
https://github.com/cockroachdb/cockroach/blob/480a3f07f01e158ddda7433f69806af036179fba/pkg/kv/kvserver/rangefeed/catchup_scan.go#L161-L186
In reality though, after the first call to i.Next()
, we'll be on some version that is in the window (since Next()
will only ever return such keys), and possibly on an unrelated key anyway; in the example above we'll try to match b @ meta[ts=50]
up with c @ 200
which is obviously bogus.
To Reproduce
Set up the above constellation in a unit test and watch it fail. I have not done this.
Expected behavior
When the catchup scan sees an intent, it needs to verify if it's in the window. If it is not, it should be skipped outright.
Additional data / screenshots
Environment:
I believe all 22.1.x versions are affected, since TBI support for with-diff rangefeeds was introduced in https://github.com/cockroachdb/cockroach/pull/80673 in v22.1.0-beta.5 (according to backboard).
Additional context
Seen in https://github.com/cockroachlabs/support/issues/1752, where it caused a rangefeed (and thus changefeed job) to fail.
cc @cockroachdb/replication
Zendesk ticket #13619 has been linked to this issue.