vitess
vitess copied to clipboard
Bug Report: VDiff2 topo keyspace locks can be released too early
Overview of the Issue
If the context passed in to the Lock function was canceled / timed out / had exceeded its deadline after the lock was acquired, Vitess silently stopped sending keepAlive requests to etcd, and the lock would be lost after reaching the lease ttl.
We encountered this while using VDiff2. When a vdiff workflow was stopped / resumed, we'd sometimes end with the source and target streams using different snapshots, or vreplication streams ending up in weird states. The VDiff code was using a keyspace lock to prevent concurrent modification of vreplication tables, but somehow we saw that concurrent modification was happening anyway.
Digging deeper, we noticed that stopping a vdiff workflow is implemented using a cancelable context, and that cancelation would eventually propagate down to the context passed to Lock. If the context was canceled after acquiring the lock, the lock was released before the call to Unlock, and the call to Unlock failed with:
E1121 15:00:14.185298 33058 table_differ.go:105] UnlockKeyspace issues_pull_requests_ks failed: <nil>
Due to a separate bug in the VDiff code, we can't see the unlock failure message here (only <nil>), but I'm pretty sure the unlocking failed because the lock was no longer held.
Reproduction Steps
N/A
Binary Version
v15.0.0
Operating System and Environment details
N/A
Log Fragments
No response
I'm going to make this about VDiff2 for now. It's a general issue for VReplication workflows, and we can start by addressing VDiff2 specifically, e.g. potentially with a new shard/workflow lock.
@arthurschreiber I'd like to try and better understand what you saw. For VDiff2 specifically, we have a mutex (snapshotMu) which should prevent this kind of concurrent modification of the shared vreplication workflow record on each target primary: https://github.com/vitessio/vitess/blob/main/go/vt/vttablet/tabletmanager/vdiff/table_differ.go#L82-L141
Were you also running other VReplication commands during this time, e.g. SwitchTraffic? That could explain it, and that's where a new topo lock might make sense to coordinate across the packages.