jetcd icon indicating copy to clipboard operation
jetcd copied to clipboard

Possible aborted reads with process crashes

Open aphyr opened this issue 2 years ago • 2 comments

Versions

  • etcd: 3.5.3
  • jetcd: 0.7.1
  • java: openjdk version "17.0.3" 2022-04-19 (Debian)

Describe the bug

I've got a Jepsen test case for etcd which appears to show something like an aborted read in response to process crashes in five-node clusters running on Debian stable. A Jetcd client submits a transaction with a modified-revision compare clause on some number of keys, and a success branch which performs some writes. Occasionally etcd returns a TxnResponse which (per wireshark) does not contain a succeeded field. When we ask Jetcd for TxnResponse.isSucceeded, it returns false; a reasonable client would assume that this transaction did not execute its successful branch. However, the effects of the successful branch are visible to later reads. If this occurred in an SQL database (and we treated succeeded for a txn with only a success branch as meaning commit/abort), I'd be inclined to call this an aborted read.

I've filed this in detail on the main etcd repo, but they've repeatedly informed me this must be a client issue--either something in jetcd or in the Jepsen test itself. I'd be delighted to find out this is the case, but based on the wireshark disassembly, I can't see how this could be the client's fault: at the wire level, I can't find any way to distinguish these "not-succeeded-but-actually-succeeded" transactions from "not-succeeded-and-actually-not-succeeded" ones. I was hoping you might be able to share some insight!

To Reproduce

Clone https://github.com/jepsen-io/etcd at a1bf380a1c09d62bf6bf2e7b97bd02a35902ed36, and run:

lein run test-all -w append --concurrency 2n --time-limit 1000 --rate 1000 --test-count 5 --nemesis kill

Expected behavior

I expect that transactions which return TxnResponse.isSucceeded() = false would not, in fact, appear to execute their success branches.

Additional context

aphyr avatar Jun 14 '22 13:06 aphyr

Honestly it's been long time since I looked at the transaction code so it is quite difficult to me to give any hint at this stage. I'd be happy to include any patch in case it is an issue in jetcd.

lburgazzoli avatar Jun 14 '22 13:06 lburgazzoli

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Aug 14 '22 01:08 github-actions[bot]