Xline
Xline copied to clipboard
[Bug]: xline cluster will enter a frozen state after multiple crash and recoveries
Description about the bug
Description about the bug
After multiple crash recoveries, xline will enter a frozen state even if the configurations of xline servers is normal.
In https://github.com/xline-kv/xline-operator/pull/16, xline-operator introduced a simple chaos validation. The specific logic is as follows:
monkeys() {
size=$1
iters=$2
max_kill=$((size / 2))
echo "monkeys: size=$size, iters=$iters, max_kill=$max_kill"
for ((i = 0; i < iters; i++)); do
case $(random 3) in
0)
echo "monkeys: put get"
value=$(random 100)
run_expect "put A $value" "OK"
run_expect "get A" "A\n$value"
;;
1)
echo "monkeys: drop pods"
# Before deleting the pod, execute "put get" to ensure that the cluster works properly.
run_expect "put A 1" "OK"
run_expect "get A" "A\n1"
# Get the current number of active nodes.
ready=$(kubectl get sts/$CLUSTER_NAME -o=jsonpath='{.status.readyReplicas}')
# Calculate the size to be killed
killed=$((ready + max_kill - size))
for ((y = 0; y < killed; y++)); do
name=$CLUSTER_NAME-$(random "$size")
kubectl delete pod/"$name" --force --grace-period=0 2>/dev/null
done
;;
2)
echo "monkeys: wait for pods"
kubectl wait --for=jsonpath='{.status.readyReplicas}'="$size" sts/$CLUSTER_NAME --timeout=10m
;;
esac
done
}
After multiple iterations, the cluster will enter a frozen state. At this point, executing a GET request will produce the following output:
$ kubectl exec pod/tester -c etcdctl -- bash -c "ETCDCTL_API=3 etcdctl --endpoints='http://my-xline-cluster-2.my-xline-cluster.default.svc.cluster.local:2379' get A"
{"level":"warn","ts":"2023-07-27T10:24:08.495Z","logger":"etcd-client","caller":"v3/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0x40001bc000/my-xline-cluster-2.my-xline-cluster.default.svc.cluster.local:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
Error: context deadline exceeded
command terminated with exit code 1
And the pods status displayed by kubectl is healthy:
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
my-xline-cluster-0 1/1 Running 0 15m
my-xline-cluster-1 1/1 Running 0 15m
my-xline-cluster-2 1/1 Running 0 16m
my-xline-cluster-3 1/1 Running 0 16m
my-xline-cluster-4 1/1 Running 0 16m
my-xline-operator-6b5979899b-7m8q7 1/1 Running 4 (84m ago) 2d5h
my-xline-operator-6b5979899b-jqw9b 1/1 Running 4 (84m ago) 2d5h
my-xline-operator-6b5979899b-pmn9d 1/1 Running 4 (84m ago) 2d5h
tester 1/1 Running 0 12m
Version
0.4.1 (Default)
Relevant log output
# Some common logs among the cluster
2023-07-27T10:12:33.248384Z WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248392Z WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248398Z WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:33.248913Z WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:35.249666Z WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:37.360188Z WARN put:propose{cmd_id="my-xline-cluster-1-a90ed525-d5a4-4c91-9a93-8b104f2e3843"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:39.458791Z WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458803Z WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458814Z WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:39.458972Z WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:41.459056Z WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:43.566937Z WARN put:propose{cmd_id="my-xline-cluster-1-32cbc24d-f57c-409a-8028-e997565bd7e5"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:51.763590Z WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.763874Z WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.763938Z WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:51.764114Z WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:fast_round: curp::client: Propose error: key conflict error
2023-07-27T10:12:53.763917Z WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
2023-07-27T10:12:55.879934Z WARN put:propose{cmd_id="my-xline-cluster-1-b32f7329-3898-44c8-b01a-7c7c3a12f699"}:slow_round: curp::client: wait synced rpc error: rpc error: status: Cancelled, message: "Timeout expired", details: [], metadata: MetadataMap { headers: {} }
Code of Conduct
- [X] I agree to follow this project's Code of Conduct