tidb-operator icon indicating copy to clipboard operation
tidb-operator copied to clipboard

unable to start TiKV due to DNS resolution

Open vanhtuan0409 opened this issue 2 years ago • 3 comments

Bug Report

What version of Kubernetes are you using?

1.27.3 under k3s distribution

What version of TiDB Operator are you using?

1.4.5

What storage classes exist in the Kubernetes cluster and what are used for PD/TiKV pods?

local-path

What's the status of the TiDB cluster pods?

Running

What did you do?

What did you expect to see?

TiKV successfully started and able to connect to PD server

What did you see instead?

TiKV unable to start with these following logs. Altho running curl directly within TiKV pod successfully, TiKV unable to connect to PD server

[2023/07/05 04:21:36.755 +00:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:38.756 +00:00] [INFO] [util.rs:560] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:39.057 +00:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:41.058 +00:00] [INFO] [util.rs:560] ["PD failed to respond"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [endpoints=http://tidb-pd:2379]
[2023/07/05 04:21:41.359 +00:00] [INFO] [util.rs:598] ["connecting to PD endpoint"] [endpoints=http://tidb-pd:2379]

My hypothesis is that somehow single level DNS was unable to resolve. I tried to edit TiKV configmap to change it to ${CLUSTER_NAME}-pd.${NAMESPACE}.svc:2379 then it was successfully connect but later got reversed by the operator

Propose fix at https://github.com/pingcap/tidb-operator/pull/5145

vanhtuan0409 avatar Jul 05 '23 04:07 vanhtuan0409

tidb-cluster Helm chart is deprecated, and we recommend to use the TidbCluster CRD now.

csuzhangxc avatar Jul 05 '23 06:07 csuzhangxc

TidbCluster CRD also suffer from this issue. The operator will create a configmap for tikv startup scripts. I am willing to contribute update, may you point me to the snippet where the operator create startup scripts configmap?

vanhtuan0409 avatar Jul 06 '23 04:07 vanhtuan0409

For the connectivity case, could you add the following environment for TiKV?

env:
  - name: GRPC_DNS_RESOLVER
    value: native

currently, we have two versions of StartScripts for CRDs

https://github.com/pingcap/tidb-operator/blob/3279ab51394c0e18638b6c7b1da7ac5b5a67d5bd/pkg/manager/member/startscript/v1/template.go#L251

https://github.com/pingcap/tidb-operator/blob/3279ab51394c0e18638b6c7b1da7ac5b5a67d5bd/pkg/manager/member/startscript/v2/tikv_start_script.go#L92

But if we change the StartScript directly, and then after we upgrade the TiDB Operator, as the ConfigMap will be upgraded, then all existing clusters we be restarted.

csuzhangxc avatar Jul 06 '23 07:07 csuzhangxc