etcd
etcd copied to clipboard
Possible aborted reads with process crashes
What happened?
With etcd 3.5.3 and jetcd 4.1.74.Final, I suspect that the effects of failed transactions can be visible to later readers and writers when processes are killed with kill -9
. For example, take this pair of operations from one Jepsen run, in a five-node cluster running on Debian 11 LXC nodes:
:anomalies {:G1a ({:op {:type :ok,
:f :txn,
:value [[:append 4325 21]
[:r
4325
[5 7 10 11 14 21]]
[:r
4325
[5 7 10 11 14 21]]],
:time 224493456205,
:process 143,
:index 224833},
:mop [:r 4325 [5 7 10 11 14 21]],
:writer {:type :fail,
:f :txn,
:value [[:append 4325 14]
[:r 4325 nil]
[:r 4324 nil]
[:append 4324 12]],
:time 226412418904,
:process 145,
:error :didn't-succeed,
:index 227733},
:element 14}
The :writer
transaction began by appending (via CaS) the element 14 to a list stored in key 4325. That transaction returned a TxnResponse to the client rather than throwing, but response.isSucceeded()
returned false
--we assume that since it did not succeed, it must have aborted. However, the :op
transaction performed a read of key 4325 and observed element 14! If the writer did in fact abort, this op would constitute an aborted read.
Here's the full logs from this test run.
What did you expect to happen?
I expect that if a transaction does not succeed, and also does not throw an exception (as jetcd typically does for indefinite errors), it would not be visible to later readers.
How can we reproduce it (as minimally and precisely as possible)?
Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:
lein run test-all -w append --concurrency 2n --time-limit 1000 --rate 1000 --test-count 5 --nemesis kill
This is basically the same test suite we used for the etcd testing back in late 2019--I've just updated it to the latest jetcd, bumped dependencies, and pointed it at 3.5.3 instead.
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5
Etcd configuration (command line flags or environment variables)
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started | n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 | false |
| 4824313a421b2502 | started | n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 | false |
| 4d6e27d122507e9c | started | n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 | false |
| a1ffd5acd6a88a6a | started | n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 | false |
| afa39e55dee6dc2e | started | n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 | false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 | 3.5.3 | 4.2 MB | true | false | 20 | 60226 | 60226 | |
| http://192.168.122.103:2379 | 1153c9690d2b2284 | 3.5.3 | 4.2 MB | false | false | 20 | 60226 | 60226 | |
| http://192.168.122.104:2379 | 4d6e27d122507e9c | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | |
| http://192.168.122.101:2379 | afa39e55dee6dc2e | 3.5.3 | 4.2 MB | false | false | 20 | 60228 | 60228 | |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
Relevant log output
Full logs are in the attached zip file.
From brief read this looks like bug in jetcd and not etcd. Please file it to jetcd repo or provide explanation what is the error in Etcd.
Could you tell me a bit more about why you suspect jetcd is to blame here, rather than etcd itself?
Here's a protocol-level view of the same kind of anomaly. We make a request to conditionally set key 693 to a new value ending in ... 29 iff its revision is still 7191:
Four milliseconds later, the server responded with an HTTP2 200 OK response, with grpc-status: 0
, containing a mostly-blank Txn,Response message--notably without a succeeded
field. I'm not sure, but since I can't find any examples of responses which have succeeded = 0
, I suspect that the absence of the succeeded field entirely might be how etcd's protocol encodes a transaction guard failure.
Nevertheless, this write of 29 appeared in later reads, which suggests this is an aborted read.
:f :txn,
:value [[:r 697 [4 5 11 12 13 16 17 18 20 22 24 25 27 28 30]]
[:r 693 [1 2 3 4 6 8 11 12 13 15 16 18 20 21 22 23 25 27 29]]
[:r 698 [5 6 7 4 19 20 24 30]]
[:append 697 9]],
:time 28156744830,
:process 8,
:index 36577},
To confirm this, here's a transaction which definitely did fail--again, I believe, because its precondition guard failed to match. Again, HTTP 200, GRPC status 0, a Txn,response message without a succeeded
field:
Compare this to what happens during a successful request: we get back a Txn,Response which does have a succeeded
field, as well as kvs
structures containing the results of that transaction:
Also compare this to how the protocol represents an exceptional response--for instance, here's a malformed request which caused etcd to return an HTTP 200 with gprc-status: 3
(INVALID_ARGUMENT) and a grpc-message
of etcdserver: duplicate key given in txn request
:
Here's a zip file with tcpdump captures of the client-server traffic on all nodes, if you'd like to see for yourself. The write causing an aborted read is in n5/tcpdump.pcap.
response.isSucceeded() returned false--we assume that since it did not succeed, it must have aborted
It isn't correct. It only means whether the compare
is evaluated to true
or false
. See rpc.pb.go#L1371-L1372
Begging your pardon, but are you saying that a transaction whose guard clause evaluates to false
should go on to evaluate the transaction's success
block? That seems to be at odds with both the transaction documentation and etcd's behavior in, well, pretty much every other response in these tests...
Two comments/points:
- The TxnResponse.Succeeded only means the result of
compare
(TheIf
statement); - When the
compare
is evaluated tofalse
, theFailure
block is processed. Otherwise, theSuccess
block is processed.
OK, so... to repeat, we have multiple cases where a transaction's compare failed, but etcd apparently executed the success block anyway--or if it didn't "really" execute, other transactions are able to observe state which could only have arisen from executing the success block. You're saying this is not a bug?
If TxnResponse
has the header (cluster_id, member_id, revision and raft_term), then it must contain the field succeeded
, because its type is bool
and it always has a value no matter a program explicitly assigns a value to it or not.
where a transaction's compare failed, but etcd apparently executed the success block anyway
I still am still not convinced. Could you show me the evidence? How did you know transaction's compare failed
? Did you get the conclusion because you did not see the field succeeded
?
If TxnResponse has the header (cluster_id, member_id, revision and raft_term), then it must contain the field succeeded, because its type is bool and it always has a value no matter a program explicitly assigns a value to it or not.
Hmm. At the JVM level, in the client, this value is false
. That's because it's not present in the protocol-buffer message, and the protocol buffer spec states that the default value for bools is false
.
As for why there's no succeeded
field in the protobuf-encoded messages... Ah, I've got some speculation there, but I don't think it's actually on-topic. The core problem remains: etcd returns responses without a succeeded
field set to true
even when they apparently did succeed.
Could you show me the evidence?
I've told you what the Java client is doing, provided full logs of test runs, given you a reproducible automated test suite, shown you Wireshark disassembly of the request and response, and provided pcap files for you to confirm yourself. I'm honestly at a loss: what other kind of evidence are you looking for?
OK, based on the proto3 doc, if a scalar message field is set to its default, the value will not be serialized on the wire
. It could be the reason why you did not see the field "succeeded" in the TxnResponse
.
We have both integration and e2e test cases for transaction, and I have never seen such issue so far. I suspect it's client side issue, including jetcd or the test case.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.
Closing as a jetcd issue. Feel free to reopen if you find evidence that issue is on etcd server side.