jetcd
jetcd copied to clipboard
Possible aborted reads with process crashes
Versions
- etcd: 3.5.3
- jetcd: 0.7.1
- java: openjdk version "17.0.3" 2022-04-19 (Debian)
Describe the bug
I've got a Jepsen test case for etcd which appears to show something like an aborted read in response to process crashes in five-node clusters running on Debian stable. A Jetcd client submits a transaction with a modified-revision compare clause on some number of keys, and a success branch which performs some writes. Occasionally etcd returns a TxnResponse
which (per wireshark) does not contain a succeeded
field. When we ask Jetcd for TxnResponse.isSucceeded
, it returns false
; a reasonable client would assume that this transaction did not execute its successful branch. However, the effects of the successful branch are visible to later reads. If this occurred in an SQL database (and we treated succeeded
for a txn with only a success branch as meaning commit/abort), I'd be inclined to call this an aborted read.
I've filed this in detail on the main etcd repo, but they've repeatedly informed me this must be a client issue--either something in jetcd or in the Jepsen test itself. I'd be delighted to find out this is the case, but based on the wireshark disassembly, I can't see how this could be the client's fault: at the wire level, I can't find any way to distinguish these "not-succeeded-but-actually-succeeded" transactions from "not-succeeded-and-actually-not-succeeded" ones. I was hoping you might be able to share some insight!
To Reproduce
Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:
lein run test-all -w append --concurrency 2n --time-limit 1000 --rate 1000 --test-count 5 --nemesis kill
Expected behavior
I expect that transactions which return TxnResponse.isSucceeded() = false
would not, in fact, appear to execute their success branches.
Additional context
Honestly it's been long time since I looked at the transaction code so it is quite difficult to me to give any hint at this stage. I'd be happy to include any patch in case it is an issue in jetcd.
This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.