vitess
vitess copied to clipboard
Etcd2Topo: Use node's ModRevision consistently for in-memory topo.Version value
Description
Etcd has several version related fields/values in each key's header: Revision
, ModRevision
, and Version
. These are related but distinct things:
- https://github.com/etcd-io/etcd/issues/6518
- https://etcd.io/docs/v3.5/learning/data_model/
To summarize:
-
Revision
: a logical clock used to track changes to the etcd keyspace (a transaction / snapshot ID) -
Version
: a logical clock used to track the changes to an individual key since it was last created (it is not monotonic) -
ModRevision
: the etcd keyspace revision number for the key at that point in time (the logical version of the key)
The Revision
is what allows for consistent reads and writes of multiple keys (in a transaction). It allows you, for example, to do a consistent read / scan of N keys at that logical point-in-time.
In Vitess we also have awareness of each value and read/use these values in memory in the etcd2topo
topo server implementation. There's also a generic and opaque topo.Version
type — which can be set to either of the two lower level node/key specific values: ModRevision
or Version
— that we use for strict serializability.
In https://github.com/vitessio/vitess/pull/15701 I aligned the Vitess and etcd2topo
lower level version related fields, variable names, and comments for watches: https://github.com/vitessio/vitess/pull/15701/files/f1fed6dd787feea26a4f4e13f660e8eaa91e64bb#diff-3f055ca1a4446966abba5a840e13c5c6a11e73121143a66b06070a356425d676
In later unrelated investigations, however, I realized that the use of Version: EtcdVersion(initial.Kvs[0].ModRevision)
in Watch()
was intentional and consistent with how we're using ModRevision
for the node's topo.Version
value in file.go
. The reason being that we use this to enforce strict serializability when a topo.Version
value is provided — failing the write if the topo.Version
value provided is not equal to the current ModRevision
of the key — and we return the new topo.Version
value based on the new ModRevision
value (technically we return the Revision
value from the response header, but that would also be the ModRevision
value for the key we updated in the transaction) on successful update: https://github.com/vitessio/vitess/blob/f27d287e0873d139ad2f71588068256e28a16f6a/go/vt/topo/etcd2topo/file.go#L48-L74
We use the ModRevision
value there as etcd's transaction response includes the new ModRevision
from the commit
(again, technically it's the Revision
value from the response header, but that would also be the ModRevision
value for the key we updated in the transaction) but NOT the new node/key Version
. This makes sense because a transaction can update any arbitrary number of keys. So the only logical version that can be provided for that transaction is the keyspace Revision
, which would be the ModRevision
value for each of the keys updated in the transaction. This is the logical point in time for the state of the data in etcd — equivalent to a GTID position/set in MySQL.
That is used e.g. with Shard
records/keys and the ShardInfo
in memory structure: https://github.com/vitessio/vitess/blob/f27d287e0873d139ad2f71588068256e28a16f6a/go/vt/topo/shard.go#L165-L222
One way for us to think about this: an etcd watch is equivalent to a (filtered) binlog stream with MySQL. We don’t know the version of a table or row, we only know their state at a logical point in time — the GTID position. The revision in etcd is that logical point in time / position in the stream of events/changes.
So with all of this in mind, we should also be using ModRevision
for the topo.Version
value in Watch()
— and beyond that, we should be using it for all topo.Version
values within the etcd2topo
implementation. Please see the discussion from here on for further reasons why.
So in the end, this PR does just that: it uses the etcd ModRevision
value for the logical topo.Version
value of keys across the etcd2topo
implementation and adds a test to check and enforce that for Watch()
.
Related Issue(s)
- Follow-up to: https://github.com/vitessio/vitess/pull/15701
Checklist
- [x] "Backport to:" labels have been added if this change should be back-ported to release branches
- [x] If this change is to be back-ported to previous releases, a justification is included in the PR description
- [x] Tests were added or are not required
- [x] Did the new or modified tests pass consistently locally and on CI?
- [x] Documentation was added or is not required
Review Checklist
Hello reviewers! :wave: Please follow this checklist when reviewing this Pull Request.
General
- [ ] Ensure that the Pull Request has a descriptive title.
- [ ] Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.
Tests
- [ ] Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.
Documentation
- [ ] Apply the
release notes (needs details)
label if users need to know about this change. - [ ] New features should be documented.
- [ ] There should be some code comments as to why things are implemented the way they are.
- [ ] There should be a comment at the top of each new or modified test to explain what the test does.
New flags
- [ ] Is this flag really necessary?
- [ ] Flag names must be clear and intuitive, use dashes (
-
), and have a clear help text.
If a workflow is added or modified:
- [ ] Each item in
Jobs
should be named in order to mark it asrequired
. - [ ] If the workflow needs to be marked as
required
, the maintainer team must be notified.
Backward compatibility
- [ ] Protobuf changes should be wire-compatible.
- [ ] Changes to
_vt
tables and RPCs need to be backward compatible. - [ ] RPC changes should be compatible with vitess-operator
- [ ] If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
- [ ]
vtctl
command output order should be stable andawk
-able.
To clarify, the use of Version
is not causing a functional issue in Vitess, right? This change is only to be consistent?
Since we are watching a specific key, it shouldn't matter if we use Version
or ModRevision
as both will increment anytime a key is updated.
To clarify, the use of
Version
is not causing a functional issue in Vitess, right? This change is only to be consistent?Since we are watching a specific key, it shouldn't matter if we use
Version
orModRevision
as both will increment anytime a key is updated.
Quote from the description: this PR is really a 1 line revert from the noted PR.
So the way I'm reading this is that there was an unnecessary change in a previous PR which may actually cause incorrect behavior, and this PR is fixing that.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 68.42%. Comparing base (
f118ba2
) to head (0ea964b
). Report is 66 commits behind head on main.
Additional details and impacted files
@@ Coverage Diff @@
## main #15847 +/- ##
==========================================
+ Coverage 68.40% 68.42% +0.01%
==========================================
Files 1556 1559 +3
Lines 195121 196537 +1416
==========================================
+ Hits 133479 134482 +1003
- Misses 61642 62055 +413
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
To clarify, the use of
Version
is not causing a functional issue in Vitess, right? This change is only to be consistent?Since we are watching a specific key, it shouldn't matter if we use
Version
orModRevision
as both will increment anytime a key is updated.
Version
and ModRevision
will both be increasing, yes. So you can use one or the other as long as you're consistent. They just aren't the same logical or literal value (as Revision
is incremented each time the etcd keyspace is updated). So it matters which one you return, as the caller can then pass that back in on subsequent topo calls, as we then check that this value is EQUAL TO the key's current ModRevision (if a topo.Value
was passed in, e.g. to Update()
).
AFAIK, the logical version within watches only comes in to play when we need to resume/restart the watch: https://github.com/vitessio/vitess/blob/7603abc53459065f4420c92a0ae29f5ccb1cad10/go/vt/topo/etcd2topo/watch.go#L79-L110
As you can see there, it's already using the revision value (etcd keyspace version) for both. BUT, in looking at that I noticed that in a relatively new usage of watchData{}
from https://github.com/vitessio/vitess/pull/10906 it was using Version
rather than ModRevision
. I can now understand WHY that was done at the time, but I've now updated that to use ModRevision
just as the normal (non-recursive) Watch
does for the same reasons as noted in the issue description.
I checked out your branch. I seem to see Version
and ModRevision
still used inconsistently. What am I missing?
Since no test is failing, I am assuming we don't really use the version value anywhere. Looking at a couple of watchers (Shard and SrvVSchema, we only look at the new contents of the key and not the version). Otherwise I would have expected some unit test to have failed with the changes in this PR or the previous one.
In etcd2topo/watch.go
.Watch()
_the initial value uses ModRevision
https://github.com/vitessio/vitess/blob/d49f984e9f0c61d9289f6b76bba081e44d860d03/go/vt/topo/etcd2topo/watch.go#L52-L58
For new values it sends Version
https://github.com/vitessio/vitess/blob/d49f984e9f0c61d9289f6b76bba081e44d860d03/go/vt/topo/etcd2topo/watch.go#L136-L140
In WatchRecursive()
we first use Version
for the initial value and then ModRevision
for new values.
initial: https://github.com/vitessio/vitess/blob/d49f984e9f0c61d9289f6b76bba081e44d860d03/go/vt/topo/etcd2topo/watch.go#L176-L182
watcher: https://github.com/vitessio/vitess/blob/4054eb7ccfed299d4bdc664422ff3d75b725f490/go/vt/topo/etcd2topo/watch.go#L257-L264
OK, I just saw after my comment above that you have a new commit where everything is set as ModRevision
. I guess this should be good. It will still be useful to have a unit test that shows that we get newer revisions uniformly in the watcher, since we think it is worthwhile to modify this behaviour.
OK, I just saw after my comment above that you have a new commit where everything is set as
ModRevision
. I guess this should be good.
Yes, I noticed this too in answering your previous question and I updated things to be consistent (also updated the description for further clarification).
It will still be useful to have a unit test that shows that we get newer revisions uniformly in the watcher, since we think it is worthwhile to modify this behaviour.
Sure, I can add one.
Thinking about this some more: the watchers are watching a specific key, right? So shouldn't we be passing Version
which is key-specific, rather than the global ModRevision
. For example, in a unit test, if the SrvVSchema is changed thrice, we can expect the watcher to stream us versions 1, 2 and 3, but we have no idea from the ModRevision
whether we have received all the updates.
When will using Version
everywhere be wrong?
Thinking about this some more: the watchers are watching a specific key, right? So shouldn't we be passing
Version
which is key-specific, rather than the globalModRevision
. For example, in a unit test, if the SrvVSchema is changed thrice, we can expect the watcher to stream us versions 1, 2 and 3, but we have no idea from theModRevision
whether we have received all the updates.
ModRevision
is a metadata property of the key. That value is monotonically increasing. Version
goes back to 0 and then 1 if the key is deleted and recreated. The watchers have never guaranteed that you don't miss an intermediate value AFAIK. They can't, as etcd will do compactions and a user can request it as well. This effectively prunes the history. So in various scenarios (including etcd failover or vitess process restart/failover) you will have to restart your watch and you may have missed intermediate values as they are no longer available and you start with the current/latest version.
https://github.com/vitessio/vitess/blob/96c43e40b86dd03b8a462507ebd82fbc5c927f8e/go/vt/topo/etcd2topo/watch.go#L63-L66
https://github.com/etcd-io/etcd/blob/35361955578322a55841c1ed962c53e28658d951/client/v3/watch.go#L53-L82
It’s worth noting why Watch actually exists. It’s not because you want to process every change of a key (it doesn’t offer those guarantees) — it’s that you want to get the new version of a key when it’s changed. In other words, you always want your in-memory representation to be up-to-date / have the latest values from the database. https://etcd.io/docs/v3.5/tutorials/how-to-watch-keys/ It only guarantees that you (eventually) get notifications about future changes to the key (with "future" being based on the revision). That’s it. If you’re not already caught up, you can miss intermediate changes.
And it's important to note that the etcd client library only takes revisions on Get
,Watch
,Update
,List
etc as there is no clientv3.WithVersion
:
s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(rev))
For example, get: https://github.com/etcd-io/etcd/blob/35361955578322a55841c1ed962c53e28658d951/client/v3/kv.go#L41-L49
You're saying that you want to watch this key from this database snapshot. And this is the only way that you can coordinate across keys, whether you're using transactions, multi-key watches (as we do with WatchRecursive
), prefix scans, etc.
It's also worth noting that I haven't yet found places where Etcd itself, in the product or in the documentation, uses the key's Version
. For example: https://etcd.io/docs/v3.5/dev-guide/interacting_v3/#watch-historical-changes-of-keys
So perhaps the key's Version
was more of an etcdv2 thing — as keyspace revisions were added in storage v3 — that is being phased out. I'm not sure. It also makes sense to do so given the support for (prefix) scans and transactions in v3 so that you have uniformity in the API whether you're interacting with a single key or multiple ones.
When will using
Version
everywhere be wrong?
Wrong is not the correct word to use here IMO. We have to define the semantics and guarantees — only then can we know the best option. I've made every attempt to explain why ModRevision
has to be used in the places where we care most about the topo.Version
value (see updated issue description). So we know that we can't use it everywhere. If you only mean everywhere in the Watch then I get the inclination. The downside I see there is that if you read the latest version of a key from the watcher, then update it, you won't be able to save that back in the topo if you provide the topo.Version
value you have. So you'd be forced to do a blind write.
So I take this in total to mean that the key's Version
value is handy for humans but what really matters is its ModRevision
value — as that's what is used for consistent reads and writes and thus what you pass when you want to read, watch, update the key etc — really any time you're interfacing with the database on a request now.
After further investigation (see my previous comment: https://github.com/vitessio/vitess/pull/15847#issuecomment-2096935831) I think that we should use Revision/ModRevision
everywhere as the topo.Version
value because that seems to be the only thing that etcdv3 cares about. In this PR, we're already doing that with one exception:
❯ git grep -E 'Version|Revision' | grep etcd2topo | grep -v "Apache License, Version 2.0"
go/vt/topo/etcd2topo/election.go: clientv3.WithSort(clientv3.SortByModRevision, clientv3.SortAscend),
go/vt/topo/etcd2topo/election.go: clientv3.WithSort(clientv3.SortByModRevision, clientv3.SortAscend),
go/vt/topo/etcd2topo/election.go: watcher := mp.s.cli.Watch(ctx, electionPath, clientv3.WithPrefix(), clientv3.WithRev(initial.Header.Revision))
go/vt/topo/etcd2topo/election.go: clientv3.WithSort(clientv3.SortByModRevision, clientv3.SortAscend),
go/vt/topo/etcd2topo/file.go:func (s *Server) Create(ctx context.Context, filePath string, contents []byte) (topo.Version, error) {
go/vt/topo/etcd2topo/file.go: If(clientv3.Compare(clientv3.Version(nodePath), "=", 0)).
go/vt/topo/etcd2topo/file.go: return EtcdVersion(txnresp.Header.Revision), nil
go/vt/topo/etcd2topo/file.go:func (s *Server) Update(ctx context.Context, filePath string, contents []byte, version topo.Version) (topo.Version, error) {
go/vt/topo/etcd2topo/file.go: If(clientv3.Compare(clientv3.ModRevision(nodePath), "=", int64(version.(EtcdVersion)))).
go/vt/topo/etcd2topo/file.go: return nil, topo.NewError(topo.BadVersion, nodePath)
go/vt/topo/etcd2topo/file.go: return EtcdVersion(txnresp.Header.Revision), nil
go/vt/topo/etcd2topo/file.go: return EtcdVersion(resp.Header.Revision), nil
go/vt/topo/etcd2topo/file.go:func (s *Server) Get(ctx context.Context, filePath string) ([]byte, topo.Version, error) {
go/vt/topo/etcd2topo/file.go: return resp.Kvs[0].Value, EtcdVersion(resp.Kvs[0].ModRevision), nil
go/vt/topo/etcd2topo/file.go: results[n].Version = EtcdVersion(pairs[n].ModRevision)
go/vt/topo/etcd2topo/file.go:func (s *Server) Delete(ctx context.Context, filePath string, version topo.Version) error {
go/vt/topo/etcd2topo/file.go: If(clientv3.Compare(clientv3.ModRevision(nodePath), "=", int64(version.(EtcdVersion)))).
go/vt/topo/etcd2topo/file.go: return topo.NewError(topo.BadVersion, nodePath)
go/vt/topo/etcd2topo/lock.go: If(clientv3.Compare(clientv3.Version(newKey), "=", 0)).
go/vt/topo/etcd2topo/lock.go: return newKey, txnresp.Header.Revision, nil
go/vt/topo/etcd2topo/server.go: MinVersion: tls.VersionTLS12,
go/vt/topo/etcd2topo/version.go:// EtcdVersion is etcd's idea of a version.
go/vt/topo/etcd2topo/version.go:// It implements topo.Version.
go/vt/topo/etcd2topo/version.go:type EtcdVersion int64
go/vt/topo/etcd2topo/version.go:// String is part of the topo.Version interface.
go/vt/topo/etcd2topo/version.go:func (v EtcdVersion) String() string {
go/vt/topo/etcd2topo/watch.go: // ModRevision is used for the topo.Version value as we get the new Revision value back
go/vt/topo/etcd2topo/watch.go: Version: EtcdVersion(initial.Kvs[0].ModRevision),
go/vt/topo/etcd2topo/watch.go: watcher := s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(initial.Header.Revision))
go/vt/topo/etcd2topo/watch.go: var rev = initial.Header.Revision
go/vt/topo/etcd2topo/watch.go: rev = wresp.Header.GetRevision()
go/vt/topo/etcd2topo/watch.go: Version: EtcdVersion(ev.Kv.ModRevision),
go/vt/topo/etcd2topo/watch.go: wd.Version = EtcdVersion(initial.Kvs[0].ModRevision)
go/vt/topo/etcd2topo/watch.go: watcher := s.cli.Watch(watchCtx, nodePath, clientv3.WithRev(initial.Header.Revision), clientv3.WithPrefix())
go/vt/topo/etcd2topo/watch.go: var rev = initial.Header.Revision
go/vt/topo/etcd2topo/watch.go: rev = wresp.Header.GetRevision()
That one exception is when explicitly doing create
operations on the Vitess side:
- Keys: https://github.com/vitessio/vitess/blob/96c43e40b86dd03b8a462507ebd82fbc5c927f8e/go/vt/topo/etcd2topo/file.go#L29-L38
- Locks: https://github.com/vitessio/vitess/blob/96c43e40b86dd03b8a462507ebd82fbc5c927f8e/go/vt/topo/etcd2topo/lock.go#L50-L63
The usage there makes sense to me because it's a special case when creating a key — we want to enforce that it doesn't already exist because at lower layers / within etcd it's a put
operation whether you're creating a key or updating it.
So I'm going to add a test to confirm and enforce that we are using ModRevision
in watches and that we observe the relevant semantics. This way we are at least consistent, testing the current behavior, and preventing (unintentional) changes going forward.
@rohit-nayak-ps Let's also look at an example since you seem to suspect that Version
would give us better properties/semantics in determining order / sequence of operations and a key's relative logical version (is this key value newer than another one):
# It doesn't yet exist so its Version and ModRevision are both 0
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":51,"raft_term":2}}
❯ etcdctl put /testval "foo"
OK
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":52,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":52,"version":1,"value":"Zm9v"}],"count":1}
❯ etcdctl put /testval "foo2"
OK
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":53,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":53,"version":2,"value":"Zm9vMg=="}],"count":1}
❯ etcdctl put /testval "foo3"
OK
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":54,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":54,"version":3,"value":"Zm9vMw=="}],"count":1}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":55,"raft_term":2}}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":56,"raft_term":2}}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":57,"raft_term":2}}
❯ etcdctl put -w json /testval "fooX"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":58,"raft_term":2}}
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":58,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":52,"mod_revision":58,"version":7,"value":"Zm9vWA=="}],"count":1}
❯ etcdctl del /testval
1
❯ etcdctl put -w json /testval "foo"
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":60,"raft_term":2}}
❯ etcdctl get -w json /testval
{"header":{"cluster_id":14841639068965178418,"member_id":10276657743932975437,"revision":60,"raft_term":2},"kvs":[{"key":"L3Rlc3R2YWw=","create_revision":60,"mod_revision":60,"version":1,"value":"Zm9v"}],"count":1}
As you can see, the Revision
value is monotonic. So we can always know that one logical version of a key is newer than another one when using ModRevision
because the Revision
value is a global monotonically increasing value giving us a global total order and the ModRevision
value gives us a total order for the key — including when deletes are mixed in. Version
, however, is NOT monotonic — it resets to 0 when it's deleted. So if we used that to try and determine logical versions and whether we're getting things in total order and determine relative order from that — is this version of the key logically newer than another — then a logically older version of the key could easily be seen as the most recent one. If you look at the above output, the Version goes from 0-7, then back to 1. So if we used the Version
we'd see logical time going backwards and we'd think an old version of the node/key (let's say version 7) was actually newer than the latest (version 1).
Hopefully this, in combination with my previous comments, helps make it clear why using the Version
value gives us worse properties overall IMO and we shouldn't choose to use it.
Let's also look at an example since you seem to suspect that Version would give us better properties/semantics in determining order / sequence of operations and a key's relative logical version (is this key value newer than another one):
I do understand the differences between Version and ModRevision and that if a key is deleted we restart the versioning.
My only context to ask about Version vs ModRevision is related to the watcher, since that is the code path this PR affects. My understanding in this context is:
If we don't care about the version of the watched object (has it been changed X times) and/or if it is a newly created key then it doesn't matter which one we use.
If it does matter, do we delete the watched key ever? If not, it doesn't matter which one we use.
If so, does it matter to us that a key was deleted? If not, it doesn't matter which one we use.
If so, we do want the Version
(or both the CreateRevision
and ModRevision
) since it will tell us that we have a new key.
So in terms of watching specific keys, I believe the Version
gives us more information than ModRevision
. Or possibly sending all the version numbers, Version
, CreateRevision
and ModRevision
is most flexible.
Just to round off the discussion on this PR, from my side, I am fine with whatever the others decide we want to do now for the watch versioning since we don't seem to depend on the versions in this context anyway.
We can always change it later if any specific functionality is needed that depends on this decision.
@rohit-nayak-ps and @deepthi thank you for the reviews and discussion! I've now thoroughly documented the reason for this PR and the reasoning for the changes made. I've also now added a unit test to confirm and enforce the new expected behavior.
I'm considering this work done unless I get any further review comments. Thanks again!