origin etcd - All grpc_code for grpc_method "Watch" is "Unavailable"

Hi, I noticed, that every grpc_code for grpc_method "Watch" is "Unavailable" in my okd cluster. My plan is to monitor etcd-instances with default prometheus alerts from the etcd-project. Maybe the watch-connection is not closed correctly and goes into an timeout.

Version

Client Version: 4.7.18
Server Version: 4.7.0-0.okd-2021-08-22-163618
Kubernetes Version: v1.20.0-1093+4593a24e8fd58d-dirty

Steps To Reproduce

install okd 4.7
Switch to etcd project oc project openshift-etcd
Log in to the first etcd member oc rsh etcd-master1.mycompany.com
curl -s --cacert "/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt" --cert "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.crt" --key "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.key" https://localhost:2379/metrics

Current Result

grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434

Expected Result

grpc_server_handled_total{grpc_code="OK",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434

Additional Information

If that behavior is already fixed or it's a false positive, let me know.

Jul 13 '18 13:07 Reamer

@openshift/sig-master

Jul 25 '18 19:07 jwforres

Still present with 3.10

oc v3.10.0+0c4577e-1
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-cp-lb-01.cloud.example.de:443
openshift v3.10.0+7eee6f8-2
kubernetes v1.10.0+b81c8f8

Aug 09 '18 09:08 Reamer

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Nov 07 '18 10:11 openshift-bot

+1 on this. We've disabled this alert on our setup because it's just flapping and not indicating any failures.

Nov 07 '18 11:11 vsliouniaev

/remove-lifecycle stale

Nov 07 '18 13:11 Reamer

+1 on this , I also found it on etcd cluster master node , when add etcd3_alert.rules ..

it will cycle five mintue ... but we can't find something wrong with k8s ....

Nov 13 '18 03:11 gaopeiliang

/remove-lifecycle stale

Nov 13 '18 03:11 gaopeiliang

+1. I run etcd with debug log lever, and find this error:

etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 71; CANCEL")

errors about 1 time ~ in 5 minutes, stream ID - unique

etcd 3.2.24 / 3.2.25 / 3.3.10 Monitoring with prometheus (i getting this allert).

Any updates?

Nov 29 '18 12:11 arslanbekov

+1, ectd 3.3.10 with Prometheus Operator on Kubernetes 1.11.5

I have 5 nodes, but only one node having the alert, Others seem fine.

the etcd cluster runs well without issue.

Dec 19 '18 00:12 judexzhu

Feb 16 '19 12:02 zqyangchn

/remove-lifecycle stale

Feb 16 '19 12:02 zqyangchn

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

May 17 '19 12:05 openshift-bot

Still reproducible on Origin 3.11

May 20 '19 07:05 Reamer

/remove-lifecycle stale

May 20 '19 07:05 Reamer

Relates to https://github.com/openshift/cluster-monitoring-operator/pull/340 and https://github.com/etcd-io/etcd/issues/10289

Jun 21 '19 12:06 Reamer

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Sep 19 '19 13:09 openshift-bot

/remove-lifecycle stale Still present

Sep 19 '19 14:09 Reamer

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

Dec 18 '19 16:12 openshift-bot

/lifecycle frozen /remove-lifecycle stale

Dec 19 '19 08:12 Reamer

/assign

Jan 02 '20 17:01 hexfusion

Any news about this ?

Oct 06 '21 15:10 Joseph94m

At the moment I am using okd 4.7 and this bug is still present. Prometheus-Query:

grpc_server_handled_total{grpc_code="Unavailable",grpc_service="etcdserverpb.Watch"}

Oct 07 '21 07:10 Reamer

origin origin copied to clipboard

etcd - All grpc_code for grpc_method "Watch" is "Unavailable"

Version

Steps To Reproduce

Current Result

Expected Result

Additional Information

origin
origin copied to clipboard