CSMO-ICP
CSMO-ICP copied to clipboard
ICPetcdHighNumberOfFailedGRPCRequests triggers meaningless alert
Hi, all,
I'm from IBM Cloud Private team. In our environment, we found the alert ICPetcdHighNumberOfFailedGRPCRequests is frequently triggered (every 5 mintues).
In my investigation, I found the alert is triggered by a normal option.
Every time etcdctl lease keep-alive $lease interacts with the etcd cluster, will trigger below log, then trigger ICPetcdHighNumberOfFailedGRPCRequests alert.
{"log":"2019-07-05 08:20:02.426997 D | etcdserver/api/v3rpc: failed to receive lease keepalive request from gRPC stream (\"rpc error: code = Unavailable desc = client disconnected\")\n","stream":"stderr","time":"2019-07-05T08:20:02.427136464Z"}
Seems alert rule is not meaningful if grpc_code="Unavailable" or grpc_method="LeaseKeepAlive" , so we would like to change https://github.com/ibm-cloud-architecture/CSMO-ICP/blob/master/prometheus/alerts_icp_2.1.0.2-3.1.1/alert-rules-icp311.yaml#L34 to below content .
- alert: ICPetcdHighNumberOfFailedGRPCRequests
annotations:
message: 'etcd cluster "{{ $labels.job }}": {{ $value }}% of requests for {{
$labels.grpc_method }} failed on etcd instance {{ $labels.instance }}.'
expr: |
100 * sum(rate(grpc_server_handled_total{job=~".*etcd.*", grpc_code!="OK", grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive"}[5m])) BY (job, instance, grpc_service, grpc_method)
/
sum(rate(grpc_server_handled_total{job=~".*etcd.*"}[5m])) BY (job, instance, grpc_service, grpc_method)
> 1
for: 10m
labels:
severity: warning
Submit this issue to request for your opinion. We hope to make the change to avoid meaningless alert.
Someone meet similar issue in: https://github.com/openshift/cluster-monitoring-operator/issues/248
@RayStoner @RobertJBarron @rafal-szypulka can you help on this?
@haoqing0110 it looks that the source of the problem is still unresolved etcd bug: https://github.com/etcd-io/etcd/issues/10289 and this problem exist not only in ICP, but also in openshift: https://github.com/etcd-io/etcd/pull/10629
In my opinion, this alert should be disabled until it will be resolved in etcd. Other option may be to filter-out grpc_code!="Unavailable", grpc_method!="LeaseKeepAlive" as you did, but I am not completely sure if it will get us meaningful results. I would just disable this alert rule for now.
@rafal-szypulka Thanks! Disable is good for our case.