Xline icon indicating copy to clipboard operation
Xline copied to clipboard

[Bug]: xline cluster will enter a frozen state after multiple crash and recoveries

Open iGxnon opened this issue 11 months ago • 0 comments

Description about the bug

Description about the bug

After multiple crash recoveries, xline will enter a frozen state even if the configurations of xline servers is normal.

In https://github.com/xline-kv/xline-operator/pull/16, xline-operator introduced a simple chaos validation. The specific logic is as follows:

monkeys() {
  size=$1
  iters=$2
  max_kill=$((size / 2))
  echo "monkeys: size=$size, iters=$iters, max_kill=$max_kill"
  for ((i = 0; i < iters; i++)); do
    case $(random 3) in
    0)
      echo "monkeys: put get"
      value=$(random 100)
      run_expect "put A $value" "OK"
      run_expect "get A" "A\n$value"
      ;;
    1)
      echo "monkeys: drop pods"
      # Before deleting the pod, execute "put get" to ensure that the cluster works properly.
      run_expect "put A 1" "OK"
      run_expect "get A" "A\n1"
      # Get the current number of active nodes.
      ready=$(kubectl get sts/$CLUSTER_NAME -o=jsonpath='{.status.readyReplicas}')
      # Calculate the size to be killed
      killed=$((ready + max_kill - size))
      for ((y = 0; y < killed; y++)); do
        name=$CLUSTER_NAME-$(random "$size")
        kubectl delete pod/"$name" --force --grace-period=0 2>/dev/null
      done
      ;;
    2)
      echo "monkeys: wait for pods"
      kubectl wait --for=jsonpath='{.status.readyReplicas}'="$size" sts/$CLUSTER_NAME --timeout=10m
      ;;
    esac
  done
}

After multiple iterations, the cluster will enter a frozen state. At this point, executing a GET request will produce the following output:

$ kubectl exec pod/tester -c etcdctl -- bash -c "ETCDCTL_API=3 etcdctl --endpoints='http://my-xline-cluster-2.my-xline-cluster.default.svc.cluster.local:2379' get A"
{"level":"warn","ts":"2023-07-27T10:24:08.495Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x40001bc000/my-xline-cluster-2.my-xline-cluster.default.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
command terminated with exit code 1

And the pods status displayed by kubectl is healthy:

$ kubectl get pods
NAME                                 READY   STATUS    RESTARTS      AGE
my-xline-cluster-0                   1/1     Running   0             15m
my-xline-cluster-1                   1/1     Running   0             15m
my-xline-cluster-2                   1/1     Running   0             16m
my-xline-cluster-3                   1/1     Running   0             16m
my-xline-cluster-4                   1/1     Running   0             16m
my-xline-operator-6b5979899b-7m8q7   1/1     Running   4 (84m ago)   2d5h
my-xline-operator-6b5979899b-jqw9b   1/1     Running   4 (84m ago)   2d5h
my-xline-operator-6b5979899b-pmn9d   1/1     Running   4 (84m ago)   2d5h
tester                               1/1     Running   0             12m

Version

0.4.1 (Default)

Relevant log output

# Some common logs among the cluster

2023-07-27T10:12:33.248384Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248392Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248398Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248913Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:35.249666Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:37.360188Z  WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:39.458791Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458803Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458814Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458972Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:41.459056Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:43.566937Z  WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:51.763590Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.763874Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.763938Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.764114Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:53.763917Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:55.879934Z  WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

iGxnon avatar Jul 27 '23 10:07 iGxnon