tidb-operator
tidb-operator copied to clipboard
TiKV keep connecting to pd with 4-DEADLINE_EXCEEDED error
Bug Report
What version of Kubernetes are you using?
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.4", GitCommit:"b695d79d4f967c403a96986f1750a35eb75e75f1", GitTreeState:"clean", BuildDate:"2021-11-17T15:48:33Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.1", GitCommit:"5e58841cce77d4bc13713ad2b91fa0d961e69192", GitTreeState:"clean", BuildDate:"2021-05-21T23:01:33Z", GoVersion:"go1.16.4", Compiler:"gc", Platform:"linux/amd64"}
What version of TiDB Operator are you using?
TiDB Operator Version: version.Info{GitVersion:"v1.3.3", GitCommit:"8949b33a38396c417331859fa00c670d1a627388", GitTreeState:"clean", BuildDate:"2022-05-17T02:47:18Z", GoVersion:"go1.13.8", Compiler:"gc", Platform:"linux/amd64"}
What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?
NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION AGE
standard (default) rancher.io/local-path Delete WaitForFirstConsumer false 95d
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
pd-chaos-pd-0 Bound pvc-3b2fa30e-7c21-4411-b893-299d40912613 10Gi RWO standard 21m
pd-chaos-pd-1 Bound pvc-b921799f-57d8-4998-8287-dcd0c10329cf 10Gi RWO standard 21m
pd-chaos-pd-2 Bound pvc-2e51d966-07b4-4ac5-afe2-22d58d77c0fc 10Gi RWO standard 21m
tikv-chaos-tikv-0 Bound pvc-347784a9-7455-4ae3-be88-dea53badaadb 100Gi RWO standard 20m
tikv-chaos-tikv-1 Bound pvc-279d9124-72a7-49cf-bf64-aee1ed8419e0 100Gi RWO standard 20m
tikv-chaos-tikv-2 Bound pvc-44af264d-6ec4-4a75-95c5-cb44e63bd666 100Gi RWO standard 20m
What's the status of the TiDB cluster pods?
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
chaos-discovery-7544f577cb-67g2t 1/1 Running 0 22m 10.244.0.73 kind-control-plane <none> <none>
chaos-pd-0 1/1 Running 1 9m45s 10.244.0.91 kind-control-plane <none> <none>
chaos-pd-1 1/1 Running 1 11m 10.244.0.90 kind-control-plane <none> <none>
chaos-pd-2 1/1 Running 1 12m 10.244.0.89 kind-control-plane <none> <none>
chaos-tidb-0 2/2 Running 0 20m 10.244.0.88 kind-control-plane <none> <none>
chaos-tidb-1 2/2 Running 0 20m 10.244.0.86 kind-control-plane <none> <none>
chaos-tidb-2 2/2 Running 0 20m 10.244.0.87 kind-control-plane <none> <none>
chaos-tikv-0 1/1 Running 0 21m 10.244.0.84 kind-control-plane <none> <none>
chaos-tikv-1 1/1 Running 0 21m 10.244.0.85 kind-control-plane <none> <none>
chaos-tikv-2 1/1 Running 0 9m5s 10.244.0.92 kind-control-plane <none> <none>
What did you do?
I try to deploy by this instruction for tidb v5.4.1: https://docs.pingcap.com/tidb-in-kubernetes/stable/deploy-tidb-operator
but TIKV can not started, the error on TiKV pod was:
[2022/06/08 09:19:49.098 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:51.099 +00:00] [INFO] [util.rs:506] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:51.400 +00:00] [INFO] [util.rs:544] ["connecting to PD endpoint"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:53.401 +00:00] [INFO] [util.rs:506] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://chaos-pd:2379]
[2022/06/08 09:19:53.401 +00:00] [WARN] [client.rs:153] ["validate PD endpoints failed"] [err="Other(\"[components/pd_client/src/util.rs:539]: PD cluster failed to respond\")"]
I have precheck all the prerequisites, destroy and retry to deploy but still got the same error. Then I deploy tidb v3.0.12 and it started successfully, i try to upgrade it to v5.4.1, but guess what? The same error occurs.
What did you expect to see? The cluster should be started successfully when following the instruction.
What did you see instead? Error on TiKV when trying to connect to PD server.
@weekface could you take a look on this
Hi,do you still have questions? @VoxT
@VoxT Could you please share your TC yaml?
I have this same error in a kind cluster as well. I tried v5.3.0/6.0.0/6.1.0 using the advanced cluster config yaml as well as the basic config yaml from the examples folder. No luck same issue:
advanced-tidb-tikv-0 starting tikv-server ...
advanced-tidb-tikv-0 /tikv-server --pd=http://advanced-tidb-pd:2379 --advertise-addr=advanced-tidb-tikv-0.advanced-tidb-tikv-peer.db.svc:20160 --addr=0.0.0.0:20160 --status-addr=0.0.0.0:20180 --advertise-status-addr=advanced-tidb-tikv-0.advanced-tidb-tikv-peer.db.svc:20180 --data-dir=/var/lib/tikv --capacity=0 --config=/etc/tikv/tikv.toml
advanced-tidb-tikv-0
advanced-tidb-tikv-0 deprecated configuration, log-level has been moved to log.level
advanced-tidb-tikv-0 override log.level with log-level, Debug
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:79] ["Welcome to TiKV"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Release Version: 6.1.0"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Edition: Community"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Git Commit Hash: 080d086832ae5ce2495352dccaf8df5d40f30687"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Git Commit Branch: heads/refs/tags/v6.1.0"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["UTC Build Time: Unknown (env var does not exist when building)"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Rust Version: rustc 1.60.0-nightly (1e12aef3f 2022-02-13)"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Enable Features: jemalloc mem-profiling portable sse test-engine-kv-rocksdb test-engine-raft-raft-engine cloud-aws cloud-gcp cloud-azure"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [lib.rs:84] ["Profile: dist_release"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.992 +00:00] [INFO] [mod.rs:74] ["cgroup quota: memory=Some(9223372036854771712), cpu=None, cores={22, 5, 20, 17, 0, 16, 11, 18, 1, 7, 10, 2, 3, 8, 13, 12, 21, 23, 15, 4, 6, 14, 19, 9}"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [INFO] [mod.rs:81] ["memory limit in bytes: 69030949888, cpu cores quota: 24"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [WARN] [server.rs:1533] ["check: kernel"] [err="kernel parameters net.core.somaxconn got 4096, expect 32768"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [WARN] [server.rs:1533] ["check: kernel"] [err="kernel parameters net.ipv4.tcp_syncookies got 1, expect 0"]
advanced-tidb-tikv-0 [2022/11/13 02:32:59.998 +00:00] [WARN] [server.rs:1533] ["check: kernel"] [err="kernel parameters vm.swappiness got 60, expect 0"]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["Using polling engine: epollex"]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"grpclb\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"priority_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"weighted_target_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"pick_first\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"round_robin\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"ring_hash_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["Using ares dns resolver"]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering certificate provider factory for \"file_watcher\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"cds_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"xds_cluster_impl_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"xds_cluster_resolver_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.000 +00:00] [DEBUG] [<unknown>] ["registering LB policy factory for \"xds_cluster_manager_experimental\""]
advanced-tidb-tikv-0 [2022/11/13 02:33:00.007 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:02.009 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:02.009 +00:00] [WARN] [client.rs:163] ["validate PD endpoints failed"] [err="Other(\"[components/pd_client/src/util.rs:570]: PD cluster failed to respond\")"]
advanced-tidb-tikv-0 [2022/11/13 02:33:02.311 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:04.313 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:04.613 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:06.615 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:06.915 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:08.917 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://advanced-tidb-pd:2379]
advanced-tidb-tikv-0 [2022/11/13 02:33:09.217 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=http://advanced-tidb-pd:2379]
Not sure if this is the right place to comment. But I think this issue not only for tidb-operator but also for others.
I also face this issue running v5.4.2/v6.1.0 on my machine.
For deeper context:
Environment: CentOS 7
TiDB/TiKV/PD version: v5.4.2/v6.1.0
Architecture: PD pod (run by kind with image v5.4.2) and TiKV pods (run by kind with image v5.4.2)
[2022/11/14 16:03:06.108 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:08.109 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:08.410 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:10.411 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:10.712 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:12.713 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:13.014 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:15.015 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
[2022/11/14 16:03:15.316 +00:00] [INFO] [util.rs:575] ["connecting to PD endpoint"] [endpoints=pd0:2379]
[2022/11/14 16:03:17.317 +00:00] [INFO] [util.rs:537] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=pd0:2379]
PD config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: pd0
spec:
selector:
matchLabels:
name: pd0
template:
metadata:
labels:
name: pd0
spec:
containers:
- name: pd0
image: pingcap/pd:v5.4.2
args:
- --name=pd0
- --client-urls=http://0.0.0.0:2379
- --peer-urls=http://0.0.0.0:2380
- --advertise-client-urls=http://pd0:2379
- --advertise-peer-urls=http://pd0:2380
- --initial-cluster=pd0=http://pd0:2380
- --data-dir=/data/pd0
- --log-file=/logs/pd0.log
ports:
- containerPort: 2379
TiKV config:
apiVersion: apps/v1
kind: Deployment
metadata:
name: tikv0
spec:
selector:
matchLabels:
name: tikv0
template:
metadata:
labels:
name: tikv0
spec:
containers:
- name: tikv0
image: pingcap/tikv:v5.4.2
args:
- --addr=0.0.0.0:20160
- --advertise-addr=tikv0:20160
- --data-dir=/data/tikv0
- --pd=pd0:2379
- --log-file=/logs/tikv0.log
ports:
- containerPort: 20160
The strangest thing here is, on the same machine, when I change from using v5.4.2 to v3.0.12. The connection go through and my application worked well.
@mikechengwei @DanielZhangQD PTAL.