cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

rangefeed: catch-up scan assertion may misfire on out-of-window intent

Open tbg opened this issue 1 year ago • 2 comments

Describe the problem

The rangefeed catch-up scan has a bug. If we hit this code path (as we frequently do):

https://github.com/cockroachdb/cockroach/blob/480a3f07f01e158ddda7433f69806af036179fba/pkg/kv/kvserver/rangefeed/catchup_scan.go#L261-L267

Then the iterator may be positioned on an intent that is outside of the time window we're catching up on. This can happen if the current version is the last version for the key, and the next key is an intent belonging to a long-running (perhaps abandoned) version:

/a @ 100 <-- iterator (initially)
/b @ meta[ts=50] <-- iterator (after NextIgnoringTime)
/b @ 50
/c @ 200 <-- iterator (after Next)

However, the next iteration of the loop which encounters that intent implicitly assumes that the intent is in the time window, because it goes to the next key expecting that to be the version referenced by the intent:

https://github.com/cockroachdb/cockroach/blob/480a3f07f01e158ddda7433f69806af036179fba/pkg/kv/kvserver/rangefeed/catchup_scan.go#L161-L186

In reality though, after the first call to i.Next(), we'll be on some version that is in the window (since Next() will only ever return such keys), and possibly on an unrelated key anyway; in the example above we'll try to match b @ meta[ts=50] up with c @ 200 which is obviously bogus.

To Reproduce

Set up the above constellation in a unit test and watch it fail. I have not done this.

Expected behavior

When the catchup scan sees an intent, it needs to verify if it's in the window. If it is not, it should be skipped outright.

Additional data / screenshots

Environment:

I believe all 22.1.x versions are affected, since TBI support for with-diff rangefeeds was introduced in https://github.com/cockroachdb/cockroach/pull/80673 in v22.1.0-beta.5 (according to backboard).

Additional context

Seen in https://github.com/cockroachlabs/support/issues/1752, where it caused a rangefeed (and thus changefeed job) to fail.

tbg avatar Aug 10 '22 13:08 tbg

cc @cockroachdb/replication

blathers-crl[bot] avatar Aug 10 '22 13:08 blathers-crl[bot]

Zendesk ticket #13619 has been linked to this issue.

RoachietheSupportRoach avatar Aug 11 '22 09:08 RoachietheSupportRoach