etcd
etcd copied to clipboard
Stale read with loss of un-fsynced writes
What happened?
In this Jepsen test involving the loss of un-fsynced writes (via lazyfs) in a five-node cluster of etcd 3.5.3 nodes, we observed a stale read (a linearizability violation) on a single key despite not using serializable reads.

Numbering transactions top-to-bottom as T1, T2, T3, and T4, T3 used an etcd transaction guarded by an equality comparison on revision to appended the value 16 to key 180. T4 observed T3's write: it read key 180 and saw it ended in [... 15 16]. However, T2, which began executing almost five seconds after T4 completed (denoted by rt edges), read key 180 and saw a previous value, ending in [... 15].
Again, this relies on lazyfs, which we're still sanding off bugs in. I'm not entirely confident in this finding, but I would like to register it in case y'all can reproduce it using your own tests. :-)
What did you expect to happen?
I expected reads to be linearizable, even with the loss of un-fsynced writes.
How can we reproduce it (as minimally and precisely as possible)?
With https://github.com/jepsen-io/etcd https://github.com/jepsen-io/etcd/commit/181656bb551bbc10cdc3d959866637574cdc9e17, run:
lein run test-all -w append --rate 1000 --concurrency 4n --lazyfs --nemesis kill --time-limit 300 --test-count 10 --nemesis-interval 20
Anything else we need to know?
No response
Etcd version (please run commands below)
$ etcd --version
etcd Version: 3.5.3
Git SHA: 0452feec7
Go Version: go1.16.15
Go OS/Arch: linux/amd64
$ etcdctl version
etcdctl version: 3.5.3
API version: 3.5
Etcd configuration (command line flags or environment variables)
etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380/ --listen-client-urls http://192.168.122.101:2379/ --advertise-client-urls http://192.168.122.101:2379/ --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380/ --initial-cluster n1=[http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380](http://192.168.122.101:2380%2Cn2%3Dhttp//192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380) --snapshot-count 100 --experimental-initial-corrupt-check
Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)
admin@ip-172-31-9-53:/opt/etcd$ ./etcdctl --endpoints http://172.31.9.53:2379 member list -w table
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
| 6f0318b69646b10a | started | ec2-54-234-45-183.compute-1.amazonaws.com | http://172.31.13.117:2380 | http://172.31.13.117:2379 | false |
| 851bb95a115a0651 | started | ec2-54-90-123-74.compute-1.amazonaws.com | http://172.31.9.53:2380 | http://172.31.9.53:2379 | false |
| 8ce655fee64e8888 | started | ec2-3-88-142-40.compute-1.amazonaws.com | http://172.31.7.132:2380 | http://172.31.7.132:2379 | false |
| 9a228ee3e4899e1e | started | ec2-54-167-119-185.compute-1.amazonaws.com | http://172.31.9.60:2380 | http://172.31.9.60:2379 | false |
| dfed32f696db96d2 | started | ec2-54-204-125-79.compute-1.amazonaws.com | http://172.31.12.135:2380 | http://172.31.12.135:2379 | false |
+------------------+---------+--------------------------------------------+---------------------------+---------------------------+------------+
admin@ip-172-31-9-53:/opt/etcd$ {"level":"warn","ts":"2022-06-13T14:48:00.119Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0004fa1c0/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (context deadline exceeded)
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| http://n1:2379 | 851bb95a115a0651 | 3.5.3 | 164 kB | false | false | 2 | 2302 | 2299 | |
| http://n2:2379 | 6f0318b69646b10a | 3.5.3 | 164 kB | false | false | 3 | 2306 | 2306 | |
| http://n5:2379 | 8ce655fee64e8888 | 3.5.3 | 168 kB | false | false | 3 | 4966 | 4966 | |
+----------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
{"level":"warn","ts":"2022-06-13T14:47:38.147Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0002fee00/n1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Failed to get the status of endpoint http://n4:2379 (cont
Relevant log output
No response
Based on my discussion with @endocrimes this is also reproducible on v3.4
I have reproduced loss of un-fsynced writes in my own testing using lazyfs. Just running a single node cluster with lazyfs underneath. Kill + "lazyfs::clear-cache" results in large rollback of etcd state (tens of transactions).
@ramses @aphyr How much confidence we have in lazyfs implementation? Have lazyfs been successfully used to find similar issues in other databases?
@serathius fwiw the recent single-node data loss issues are a result of the same bug - We accept writes before they are actually flushed to the disk
@endocrimes I don't understand what you mean. Do you attribute this issue to https://github.com/etcd-io/etcd/issues/14370 ? In my tests I verified that issue exists on main branch and v3.5.5 release which should have https://github.com/etcd-io/etcd/issues/14370 fixed.
Please disregard my above comment, I found issue in my setup. I'm still working on reproducing the issue.