etcd LeaseKeepAlive Unavailable

Introduction (tl;dr at the bottom)

Hello, I am unsure if this is an ETCD bug or GRPC one or whatever this is-- so i'll try to be as explicit as possible.

As you may know there has been a few reports/issues related to grpc_code="Unavailable",grpc_method="Watch" #12196, and the interpretation of such metrics by alertmanager: #13127. Also discussed in other issues and even on other Githubs.

I come to you today with either a similar problem or a misconfiguration leading to the couple grpc_code="Unavailable",grpc_method="LeaseKeepAlive" appearing a lot in the metrics of the ETCD.

Environments:

ETCDs outside of k8s cluster environment.

Env1: 1 etcd version 3.3.15, k8s 1.18.20 ,runtime docker 18.06.3-ce
Env2: 3 etcd version 3.4.3, k8s 1.18.20, runtime docker 18.06.3-ce
Env3: 5 etcd version 3.4.3, k8s 1.17.17, runtime docker 18.06.3-ce

ETCDs configuration are pretty similar for each env:

Screenshot 2022-01-21 at 17 17 30

We've had the fabled Watch Unavailable for a while now but it hasn't bothered me much since I knew it was being looked at, so I've sort of been ignoring the following alert for that error:

- alert: EtcdHighNumberOfFailedGrpcRequests expr: (sum(rate(grpc_server_handled_total{grpc_code=~"Unknown|FailedPrecondition|ResourceExhausted|Internal|Unavailable|DataLoss|DeadlineExceeded"}[5m])) without (grpc_type, grpc_code) / sum(rate(grpc_server_handled_total{}[5m])) without (grpc_type, grpc_code)) > 0.01 for: 10m

Issue

Symptoms of the issue

Recently however i started receiving EtcdHighNumberOfFailedGrpcRequests alerts that were activated by LeaseKeepAvailable coupled with unavailable code on all 3 environments.

Example from Env2 (but it's the same on Env1 and Env3):

Screenshot 2022-01-21 at 17 35 54

In-depth look

The metrics generated by the etcds confirm this:

Screenshot 2022-01-21 at 17 38 48

How come we have the Unavailable code at 71 and received/sent at 70 and started at 71? Either I do not understand these metrics or the numbers don't add up.

I added the --log-level=debug to an ETCD to see what's actually happening and I was able to spot multiple requests marked with "failure":

As you can see there's some failure related to the lease resource (and other stuff but I am not sure this is the right venue to discuss them now :D).

Attempts at fixing the issue

I've researched the issue a bit online and decided I wanted to try with a newer etcd image since there's been multiple fixes since.

I upgraded Env1 from etcd 3.3.15 to 3.5.1 as this is my test environment.

As expected the grpc_code="Unavailable",grpc_method="Watch" disappeared but I am still facing the grpc_code="Unavailable",grpc_method="LeaseKeepAlive" at the same rate as I was before the update.

Confusion

The leases resources on K8s seem correct as do the kube-scheduler, kube-api-server, kube-controller-manager, and well... the whole cluster seems healthy.

Not expired:

All nodes ready:

tl;dr

grpc_code="Unavailable",grpc_method="LeaseKeepAlive" popping up a lot on cluster. Don't know why, not fixed by upgrading ETCD. Doesn't seem to have an impact on the cluster.

Jan 21 '22 17:01 Joseph94m

bump ?

Mar 23 '22 09:03 Joseph94m

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Jul 10 '22 13:07 stale[bot]

any work on this?

Jul 27 '22 06:07 Alvaro-Campesino

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Oct 29 '22 04:10 stale[bot]

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Aug 30 '23 19:08 liry

Contributions are welcomed!

Aug 31 '23 10:08 serathius

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Could you provide complete etcd log with the debug level (--log-level=debug)?

Aug 31 '23 15:08 ahrtr

It's unfortunate that nobody even replied. Still happening (v3.5.8) 😞

Could you provide complete etcd log with the debug level (--log-level=debug)?

Hello, we are facing the same issue, so I'll answer instead (also v3.5.8).

I'm not sure failures mentioned in issue are resulting in Unavailable grpc code. Running with GODEBUG=http2debug=1 will reveal

{"level":"debug","ts":"2023-10-31T15:29:28.638134Z","caller":"auth/store.go:1151","msg":"found command name","common-name":"<<redacted>>","user-name":"<<redacted>>","revision":88}
{"level":"debug","ts":"2023-10-31T15:29:28.638531Z","caller":"etcdserver/server.go:2142","msg":"Applying entries","num-entries":1}
{"level":"debug","ts":"2023-10-31T15:29:28.638547Z","caller":"etcdserver/server.go:2145","msg":"Applying entry","index":549010374,"term":31,"type":"EntryNormal"}
{"level":"debug","ts":"2023-10-31T15:29:28.638556Z","caller":"etcdserver/server.go:2204","msg":"apply entry normal","consistent-index":549010373,"entry-index":549010374,"should-applyV3":true}
{"level":"debug","ts":"2023-10-31T15:29:28.638569Z","caller":"etcdserver/server.go:2231","msg":"applyEntryNormal","raftReq":"header:<ID:16718641124967440555 username:\"<<redacted>>\" auth_revision:88 > txn:<compare:<target:MOD key:\"/registry/masterleases/10.95.235.36\" mod_revision:86363570 > success:<request_put:<key:\"/registry/masterleases/10.95.235.36\" value:\"k8s\\000\\n\\017\\n\\002v1\\022\\tEndpoints\\022)\\n\\023\\n\\000\\022\\000\\032\\000\\\"\\000*\\0002\\0008\\321\\002B\\000z\\000\\022\\022\\n\\020\\n\\01410.95.235.36\\032\\000\\032\\000\\\"\\000\" lease:7495269088112664745 > > failure:<request_range:<key:\"/registry/masterleases/10.95.235.36\" > > > "}
{"level":"debug","ts":"2023-10-31T15:29:28.638775Z","caller":"v3rpc/interceptor.go:182","msg":"request stats","start time":"2023-10-31T15:29:28.638103Z","time spent":"642.116µs","remote":"10.95.235.36:58620","response type":"/etcdserverpb.KV/Txn","request count":1,"request size":119,"response count":0,"response size":43,"request content":"compare:<target:MOD key:\"/registry/masterleases/10.95.235.36\" mod_revision:86363570 > success:<request_put:<key:\"/registry/masterleases/10.95.235.36\" value_size:68 lease:7495269088112664745 >> failure:<request_range:<key:\"/registry/masterleases/10.95.235.36\" > >"}
2023/10/31 15:29:28 http2: server encoding header ":status" = "200"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-headers" = "accept, content-type, authorization"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-methods" = "POST, GET, OPTIONS, PUT, DELETE"
2023/10/31 15:29:28 http2: server encoding header "access-control-allow-origin" = "*"
2023/10/31 15:29:28 http2: server encoding header "content-type" = "application/grpc"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Status"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Message"
2023/10/31 15:29:28 http2: server encoding header "trailer" = "Grpc-Status-Details-Bin"
2023/10/31 15:29:28 http2: server encoding header "grpc-status" = "0"

Worse, there is no mention of "grpc-status" = "14" whatsover in logs, so its difficult to pinpoint the reason with logs alone.

I tried enabling go-grpc messages as described in go-grpc documentation but they are not propagated to etcd output.

Any further guidance will be appreciated.

Oct 31 '23 15:10 vladorf

@tjungblu Have you ever seen similar issue in Openshift?

Oct 31 '23 16:10 ahrtr

Never seen this before.

Oct 31 '23 16:10 tjungblu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Mar 17 '24 12:03 stale[bot]

etcd etcd copied to clipboard

LeaseKeepAlive Unavailable

Introduction (tl;dr at the bottom)

Environments:

Issue

Symptoms of the issue

In-depth look

Attempts at fixing the issue

Confusion

tl;dr

etcd
etcd copied to clipboard