etcd Possible aborted reads with process crashes

What happened?

With etcd 3.5.3 and jetcd 4.1.74.Final, I suspect that the effects of failed transactions can be visible to later readers and writers when processes are killed with kill -9. For example, take this pair of operations from one Jepsen run, in a five-node cluster running on Debian 11 LXC nodes:

            :anomalies {:G1a ({:op {:type :ok,
                                    :f :txn,
                                    :value [[:append 4325 21]
                                            [:r
                                             4325
                                             [5 7 10 11 14 21]]
                                            [:r
                                             4325
                                             [5 7 10 11 14 21]]],
                                    :time 224493456205,
                                    :process 143,
                                    :index 224833},
                               :mop [:r 4325 [5 7 10 11 14 21]],
                               :writer {:type :fail,
                                        :f :txn,
                                        :value [[:append 4325 14]
                                                [:r 4325 nil]
                                                [:r 4324 nil]
                                                [:append 4324 12]],
                                        :time 226412418904,
                                        :process 145,
                                        :error :didn't-succeed,
                                        :index 227733},
                               :element 14}

The :writer transaction began by appending (via CaS) the element 14 to a list stored in key 4325. That transaction returned a TxnResponse to the client rather than throwing, but response.isSucceeded() returned false--we assume that since it did not succeed, it must have aborted. However, the :op transaction performed a read of key 4325 and observed element 14! If the writer did in fact abort, this op would constitute an aborted read.

Here's the full logs from this test run.

What did you expect to happen?

I expect that if a transaction does not succeed, and also does not throw an exception (as jetcd typically does for indefinite errors), it would not be visible to later readers.

How can we reproduce it (as minimally and precisely as possible)?

Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:

lein run test-all -w append --concurrency 2n --time-limit 1000 --rate 1000 --test-count 5 --nemesis kill

This is basically the same test suite we used for the etcd testing back in late 2019--I've just updated it to the latest jetcd, bumped dependencies, and pointed it at 3.5.3 instead.

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

Each node is started without a config file, using CLI flags like ``` etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster ```

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

root@n1:/opt/etcd# ./etcdctl member list -w table --endpoints=n1:2379
+------------------+---------+------+-----------------------------+-----------------------------+------------+
|        ID        | STATUS  | NAME |         PEER ADDRS          |        CLIENT ADDRS         | IS LEARNER |
+------------------+---------+------+-----------------------------+-----------------------------+------------+
| 1153c9690d2b2284 | started |   n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 |      false |
| 4824313a421b2502 | started |   n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 |      false |
| 4d6e27d122507e9c | started |   n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 |      false |
| a1ffd5acd6a88a6a | started |   n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 |      false |
| afa39e55dee6dc2e | started |   n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 |      false |
+------------------+---------+------+-----------------------------+-----------------------------+------------+

root@n1:/opt/etcd# ./etcdctl --endpoints=http://192.168.122.105:2379,http://192.168.122.103:2379,http://192.168.122.104:2379,http://192.168.122.102:2379,http://192.168.122.101:2379 endpoint status -w table
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|          ENDPOINT           |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://192.168.122.105:2379 | 4824313a421b2502 |   3.5.3 |  4.2 MB |      true |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.103:2379 | 1153c9690d2b2284 |   3.5.3 |  4.2 MB |     false |      false |        20 |      60226 |              60226 |        |
| http://192.168.122.104:2379 | 4d6e27d122507e9c |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.102:2379 | a1ffd5acd6a88a6a |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
| http://192.168.122.101:2379 | afa39e55dee6dc2e |   3.5.3 |  4.2 MB |     false |      false |        20 |      60228 |              60228 |        |
+-----------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Relevant log output

Full logs are in the attached zip file.

Jun 06 '22 21:06 aphyr

From brief read this looks like bug in jetcd and not etcd. Please file it to jetcd repo or provide explanation what is the error in Etcd.

Jun 07 '22 09:06 serathius

Could you tell me a bit more about why you suspect jetcd is to blame here, rather than etcd itself?

Here's a protocol-level view of the same kind of anomaly. We make a request to conditionally set key 693 to a new value ending in ... 29 iff its revision is still 7191:

Screenshot from 2022-06-07 09-51-12

Four milliseconds later, the server responded with an HTTP2 200 OK response, with grpc-status: 0, containing a mostly-blank Txn,Response message--notably without a succeeded field. I'm not sure, but since I can't find any examples of responses which have succeeded = 0, I suspect that the absence of the succeeded field entirely might be how etcd's protocol encodes a transaction guard failure.

Screenshot from 2022-06-07 10-10-20

Nevertheless, this write of 29 appeared in later reads, which suggests this is an aborted read.

                                    :f :txn,
                                    :value [[:r 697 [4 5 11 12 13 16 17 18 20 22 24 25 27 28 30]]
                                            [:r 693 [1 2 3 4 6 8 11 12 13 15 16 18 20 21 22 23 25 27 29]]
                                            [:r 698 [5 6 7 4 19 20 24 30]]
                                            [:append 697 9]],
                                    :time 28156744830,
                                    :process 8,
                                    :index 36577},

To confirm this, here's a transaction which definitely did fail--again, I believe, because its precondition guard failed to match. Again, HTTP 200, GRPC status 0, a Txn,response message without a succeeded field:

Screenshot from 2022-06-07 10-40-40

Compare this to what happens during a successful request: we get back a Txn,Response which does have a succeeded field, as well as kvs structures containing the results of that transaction:

Screenshot from 2022-06-07 10-13-47

Also compare this to how the protocol represents an exceptional response--for instance, here's a malformed request which caused etcd to return an HTTP 200 with gprc-status: 3 (INVALID_ARGUMENT) and a grpc-message of etcdserver: duplicate key given in txn request:

Screenshot from 2022-06-07 10-24-56

Here's a zip file with tcpdump captures of the client-server traffic on all nodes, if you'd like to see for yourself. The write causing an aborted read is in n5/tcpdump.pcap.

20220607T093354.000-0400.zip

Jun 07 '22 14:06 aphyr

response.isSucceeded() returned false--we assume that since it did not succeed, it must have aborted

It isn't correct. It only means whether the compare is evaluated to true or false. See rpc.pb.go#L1371-L1372

Jun 12 '22 17:06 ahrtr

Begging your pardon, but are you saying that a transaction whose guard clause evaluates to false should go on to evaluate the transaction's success block? That seems to be at odds with both the transaction documentation and etcd's behavior in, well, pretty much every other response in these tests...

Jun 13 '22 12:06 aphyr

Two comments/points:

The TxnResponse.Succeeded only means the result of compare (The If statement);
When the compare is evaluated to false, the Failure block is processed. Otherwise, the Success block is processed.

Jun 13 '22 18:06 ahrtr

OK, so... to repeat, we have multiple cases where a transaction's compare failed, but etcd apparently executed the success block anyway--or if it didn't "really" execute, other transactions are able to observe state which could only have arisen from executing the success block. You're saying this is not a bug?

Jun 13 '22 18:06 aphyr

If TxnResponse has the header (cluster_id, member_id, revision and raft_term), then it must contain the field succeeded, because its type is bool and it always has a value no matter a program explicitly assigns a value to it or not.

where a transaction's compare failed, but etcd apparently executed the success block anyway

I still am still not convinced. Could you show me the evidence? How did you know transaction's compare failed? Did you get the conclusion because you did not see the field succeeded ?

Jun 13 '22 20:06 ahrtr

If TxnResponse has the header (cluster_id, member_id, revision and raft_term), then it must contain the field succeeded, because its type is bool and it always has a value no matter a program explicitly assigns a value to it or not.

Hmm. At the JVM level, in the client, this value is false. That's because it's not present in the protocol-buffer message, and the protocol buffer spec states that the default value for bools is false.

As for why there's no succeeded field in the protobuf-encoded messages... Ah, I've got some speculation there, but I don't think it's actually on-topic. The core problem remains: etcd returns responses without a succeeded field set to true even when they apparently did succeed.

Could you show me the evidence?

I've told you what the Java client is doing, provided full logs of test runs, given you a reproducible automated test suite, shown you Wireshark disassembly of the request and response, and provided pcap files for you to confirm yourself. I'm honestly at a loss: what other kind of evidence are you looking for?

Jun 13 '22 21:06 aphyr

OK, based on the proto3 doc, if a scalar message field is set to its default, the value will not be serialized on the wire. It could be the reason why you did not see the field "succeeded" in the TxnResponse.

We have both integration and e2e test cases for transaction, and I have never seen such issue so far. I suspect it's client side issue, including jetcd or the test case.

Jun 14 '22 07:06 ahrtr

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 21 '22 02:09 stale[bot]

Closing as a jetcd issue. Feel free to reopen if you find evidence that issue is on etcd server side.

Sep 28 '22 08:09 serathius

etcd etcd copied to clipboard

Possible aborted reads with process crashes

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

etcd
etcd copied to clipboard