etcd Stale read with loss of un-fsynced writes

trafficstars

What happened?

In this Jepsen test involving the loss of un-fsynced writes (via lazyfs) in a five-node cluster of etcd 3.5.3 nodes, we observed a stale read (a linearizability violation) on a single key despite not using serializable reads.

Screenshot from 2022-06-21 11-26-07

Numbering transactions top-to-bottom as T1, T2, T3, and T4, T3 used an etcd transaction guarded by an equality comparison on revision to appended the value 16 to key 180. T4 observed T3's write: it read key 180 and saw it ended in [... 15 16]. However, T2, which began executing almost five seconds after T4 completed (denoted by rt edges), read key 180 and saw a previous value, ending in [... 15].

Again, this relies on lazyfs, which we're still sanding off bugs in. I'm not entirely confident in this finding, but I would like to register it in case y'all can reproduce it using your own tests. :-)

What did you expect to happen?

I expected reads to be linearizable, even with the loss of un-fsynced writes.

How can we reproduce it (as minimally and precisely as possible)?

With https://github.com/jepsen-io/etcd https://github.com/jepsen-io/etcd/commit/181656bb551bbc10cdc3d959866637574cdc9e17, run:

lein run test-all -w append --rate 1000 --concurrency 4n --lazyfs --nemesis kill --time-limit 300 --test-count 10 --nemesis-interval 20

Anything else we need to know?

No response

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380/ --listen-client-urls http://192.168.122.101:2379/ --advertise-client-urls http://192.168.122.101:2379/ --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380/ --initial-cluster n1=[http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380](http://192.168.122.101:2380%2Cn2%3Dhttp//192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380) --snapshot-count 100 --experimental-initial-corrupt-check

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)


admin@ip-172-31-9-53:/opt/etcd$ ./etcdctl --endpoints http://172.31.9.53:2379  member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
|        ID        | STATUS  |                    NAME                    |        PEER ADDRS         |       CLIENT ADDRS        | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6f0318b69646b10a | started |  ec2-54-234-45-183.compute-1.amazonaws.com | http://172.31.13.117:2380 | http://172.31.13.117:2379 |      false |
| 851bb95a115a0651 | started |   ec2-54-90-123-74.compute-1.amazonaws.com |   http://172.31.9.53:2380 |   http://172.31.9.53:2379 |      false |
| 8ce655fee64e8888 | started |    ec2-3-88-142-40.compute-1.amazonaws.com |  http://172.31.7.132:2380 |  http://172.31.7.132:2379 |      false |
| 9a228ee3e4899e1e | started | ec2-54-167-119-185.compute-1.amazonaws.com |   http://172.31.9.60:2380 |   http://172.31.9.60:2379 |      false |
| dfed32f696db96d2 | started |  ec2-54-204-125-79.compute-1.amazonaws.com | http://172.31.12.135:2380 | http://172.31.12.135:2379 |      false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+

admin@ip-172-31-9-53:/opt/etcd$ {"level":"warn","ts":"2022-06-13T14:48:00.119Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004fa1c0/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (context deadline exceeded)
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|    ENDPOINT    |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://n1:2379 | 851bb95a115a0651 |   3.5.3 |  164 kB |     false |      false |         2 |       2302 |               2299 |        |
| http://n2:2379 | 6f0318b69646b10a |   3.5.3 |  164 kB |     false |      false |         3 |       2306 |               2306 |        |
| http://n5:2379 | 8ce655fee64e8888 |   3.5.3 |  168 kB |     false |      false |         3 |       4966 |               4966 |        |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
{"level":"warn","ts":"2022-06-13T14:47:38.147Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002fee00/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (cont

Relevant log output

No response

Jun 21 '22 15:06 aphyr

Based on my discussion with @endocrimes this is also reproducible on v3.4

Jul 12 '22 08:07 serathius

I have reproduced loss of un-fsynced writes in my own testing using lazyfs. Just running a single node cluster with lazyfs underneath. Kill + "lazyfs::clear-cache" results in large rollback of etcd state (tens of transactions).

@ramses @aphyr How much confidence we have in lazyfs implementation? Have lazyfs been successfully used to find similar issues in other databases?

Oct 18 '22 07:10 serathius

@serathius fwiw the recent single-node data loss issues are a result of the same bug - We accept writes before they are actually flushed to the disk

Oct 18 '22 11:10 endocrimes

@endocrimes I don't understand what you mean. Do you attribute this issue to https://github.com/etcd-io/etcd/issues/14370 ? In my tests I verified that issue exists on main branch and v3.5.5 release which should have https://github.com/etcd-io/etcd/issues/14370 fixed.

Oct 18 '22 12:10 serathius

Please disregard my above comment, I found issue in my setup. I'm still working on reproducing the issue.

Oct 18 '22 16:10 serathius

etcd etcd copied to clipboard

Stale read with loss of un-fsynced writes

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

etcd
etcd copied to clipboard