origin icon indicating copy to clipboard operation
origin copied to clipboard

etcd - All grpc_code for grpc_method "Watch" is "Unavailable"

Open Reamer opened this issue 6 years ago • 22 comments

Hi, I noticed, that every grpc_code for grpc_method "Watch" is "Unavailable" in my okd cluster. My plan is to monitor etcd-instances with default prometheus alerts from the etcd-project. Maybe the watch-connection is not closed correctly and goes into an timeout.

Version
Client Version: 4.7.18
Server Version: 4.7.0-0.okd-2021-08-22-163618
Kubernetes Version: v1.20.0-1093+4593a24e8fd58d-dirty
Steps To Reproduce
  1. install okd 4.7
  2. Switch to etcd project oc project openshift-etcd
  3. Log in to the first etcd member oc rsh etcd-master1.mycompany.com
  4. curl -s --cacert "/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt" --cert "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.crt" --key "/etc/kubernetes/static-pod-certs/secrets/etcd-all-peer/etcd-peer-master1.mycompany.com.key" https://localhost:2379/metrics
Current Result
grpc_server_handled_total{grpc_code="Unavailable",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434
Expected Result
grpc_server_handled_total{grpc_code="OK",grpc_method="Watch",grpc_service="etcdserverpb.Watch",grpc_type="bidi_stream"} 1434

Additional Information

If that behavior is already fixed or it's a false positive, let me know.

Reamer avatar Jul 13 '18 13:07 Reamer

@openshift/sig-master

jwforres avatar Jul 25 '18 19:07 jwforres

Still present with 3.10

oc v3.10.0+0c4577e-1
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://s-cp-lb-01.cloud.example.de:443
openshift v3.10.0+7eee6f8-2
kubernetes v1.10.0+b81c8f8

Reamer avatar Aug 09 '18 09:08 Reamer

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Nov 07 '18 10:11 openshift-bot

+1 on this. We've disabled this alert on our setup because it's just flapping and not indicating any failures.

vsliouniaev avatar Nov 07 '18 11:11 vsliouniaev

/remove-lifecycle stale

Reamer avatar Nov 07 '18 13:11 Reamer

+1 on this , I also found it on etcd cluster master node , when add etcd3_alert.rules ..

image

it will cycle five mintue ... but we can't find something wrong with k8s ....

gaopeiliang avatar Nov 13 '18 03:11 gaopeiliang

/remove-lifecycle stale

gaopeiliang avatar Nov 13 '18 03:11 gaopeiliang

+1. I run etcd with debug log lever, and find this error:

etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = stream error: stream ID 71; CANCEL")

errors about 1 time ~ in 5 minutes, stream ID - unique

etcd 3.2.24 / 3.2.25 / 3.3.10 Monitoring with prometheus (i getting this allert).

Any updates?

arslanbekov avatar Nov 29 '18 12:11 arslanbekov

+1, ectd 3.3.10 with Prometheus Operator on Kubernetes 1.11.5

I have 5 nodes, but only one node having the alert, Others seem fine.

the etcd cluster runs well without issue.

image

judexzhu avatar Dec 19 '18 00:12 judexzhu

image

zqyangchn avatar Feb 16 '19 12:02 zqyangchn

/remove-lifecycle stale

zqyangchn avatar Feb 16 '19 12:02 zqyangchn

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar May 17 '19 12:05 openshift-bot

Still reproducible on Origin 3.11

Reamer avatar May 20 '19 07:05 Reamer

/remove-lifecycle stale

Reamer avatar May 20 '19 07:05 Reamer

Relates to https://github.com/openshift/cluster-monitoring-operator/pull/340 and https://github.com/etcd-io/etcd/issues/10289

Reamer avatar Jun 21 '19 12:06 Reamer

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Sep 19 '19 13:09 openshift-bot

/remove-lifecycle stale Still present

Reamer avatar Sep 19 '19 14:09 Reamer

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close. Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot avatar Dec 18 '19 16:12 openshift-bot

/lifecycle frozen /remove-lifecycle stale

Reamer avatar Dec 19 '19 08:12 Reamer

/assign

hexfusion avatar Jan 02 '20 17:01 hexfusion

Any news about this ?

Joseph94m avatar Oct 06 '21 15:10 Joseph94m

At the moment I am using okd 4.7 and this bug is still present. Prometheus-Query:

grpc_server_handled_total{grpc_code="Unavailable",grpc_service="etcdserverpb.Watch"}

Reamer avatar Oct 07 '21 07:10 Reamer