Xline icon indicating copy to clipboard operation
Xline copied to clipboard

(Partial) Jepsen test analysis

Open bsbds opened this issue 9 months ago • 2 comments

Tests

I partially completed Jepsen tests on Xline, which based on https://github.com/jepsen-io/etcd.

Tested the following, without nemesis to produce failures:

  • Register Tests for single registers, using knossos for linearizability checking. This test contains read/write/cas operations.
  • Set Use a compare-and-set transaction to read a set of integers from a single key and append a value to that set
  • Append Tests append/read transactions over lists. In order to provide append transactions, we need to read the current states, then perform a second transaction to perform all writes (and reads).
  • Wr Tests transactional writes and reads to registers using Elle.

Result

  • Register Failed once, haven't investigate what happend yet.
  • Set Ok
  • Append Mostly Failed
  • Wr Mostly Failed

Anomalies

The most obvious anomalies I found is txn inconsistencies in append and wr. After some investigations, I found two basic types of anomalies.

  • [x] #470
G2-item #0
Let:
  T1 = {:index 469, :time 13951607366, :type :ok, :process 3, :f :txn, :value [[:w 7 30] [:r 9 nil] [:w 9 1]]}
  T2 = {:index 472, :time 13956078295, :type :ok, :process 5, :f :txn, :value [[:r 9 nil] [:w 9 3] [:r 8 8]]}

Then:
  - T1 < T2, because T1 read key 9 = nil, and T2 set it to 3, which came later in the version order.
  - However, T2 < T1, because T2 read key 9 = nil, and T1 set it to 1, which came later in the version order: a contradiction!

This is caused by: When constructing key range for conflict check in command_from_request_wrapper, we have the following code

        RequestWrapper::TxnRequest(ref req) => req
            .compare
            .iter()
            .map(|cmp| KeyRange::new(cmp.key.as_slice(), cmp.range_end.as_slice()))
            .collect(),

The code only use compare keys for conflict check, but the child operation keys are not added here, so the command may execute out of order. A fix would be add all keys of that txn to the command.

  • [ ] #471
            :anomalies {:internal ({:op #jepsen.history.Op{:index 43,
                                                           :time 12262130884,
                                                           :type :ok,
                                                           :process 29,
                                                           :f :txn,
                                                           :value [[:w
                                                                    1
                                                                    7]
                                                                   [:r
                                                                    0
                                                                    nil]
                                                                   [:r
                                                                    1
                                                                    4]
                                                                   [:r
                                                                    2
                                                                    6]]},
                                    :mop [:r 1 4],
                                    :expected 7}

The operations in a single txn should be executed sequentially. However in Xline, we donot check for conflicts inside a single txn, all commands result are based on the storage state before the txn is exected. This behaviour is inconsistent with etcd. This needs futher discussion. Maybe we could statically check that txn after the compare is completed.

bsbds avatar Sep 28 '23 12:09 bsbds

The operations in a single txn should be executed sequentially. However in Xline, we donot check for conflicts inside a single txn, all commands result are based on the storage state before the txn is exected

Seems like it's the reason of #468, PR https://github.com/xline-kv/Xline/pull/472 would close #468 ? @bsbds

liangyuanpeng avatar Oct 08 '23 03:10 liangyuanpeng

The operations in a single txn should be executed sequentially. However in Xline, we donot check for conflicts inside a single txn, all commands result are based on the storage state before the txn is exected

Seems like it's the reason of #468, PR #472 would close #468 ? @bsbds

Sorry for the late response. Indeed it's the root cause of that. I'll fix it in another PR. Thanks for the test case!

bsbds avatar Nov 01 '23 11:11 bsbds