otomi-core icon indicating copy to clipboard operation
otomi-core copied to clipboard

create/add prometheus rules that monitor operator problems

Open Morriz opened this issue 5 years ago • 2 comments

So far when an operator fails we don't get notified upon failure. Examples:

  • istio operator: when resource exists, or config is invalid, it fails to reconcile
  • ... ?

I think we need to hunt after prometheus rules that monitor such errors. We could ultimately create these rules ourselves, but I prefer a community solution.

Morriz avatar Jul 20 '20 15:07 Morriz

Some stuff found on the world wild web (pun intended):

  - alert: IstioPilotAvailabilityDrop
    annotations:
      summary: 'Istio Pilot Availability Drop'
      description: 'Pilot pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      avg(avg_over_time(up{job="pilot"}[1m])) < 0.5
    for: 5m

  - alert: IstioMixerTelemetryAvailabilityDrop
    annotations:
      summary: 'Istio Mixer Telemetry Drop'
      description: 'Mixer pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics will not work correctly'
    expr: >
      avg(avg_over_time(up{job="mixer", service="istio-telemetry", endpoint="http-monitoring"}[5m])) < 0.5
    for: 5m

  - alert: IstioGalleyAvailabilityDrop
    annotations:
      summary: 'Istio Galley Availability Drop'
      description: 'Galley pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio config ingestion and processing will not work'
    expr: >
      avg(avg_over_time(up{job="galley"}[5m])) < 0.5
    for: 5m

  - alert: IstioGatewayAvailabilityDrop
    annotations:
      summary: 'Istio Gateway Availability Drop'
      description: 'Gateway pods have dropped during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic will likely be affected'
    expr: >
      min(kube_deployment_status_replicas_available{deployment="istio-ingressgateway", namespace="istio-system"}) without (instance, pod) < 2
    for: 5m

  - alert: IstioPilotPushErrorsHigh
    annotations:
      summary: 'Number of Istio Pilot push errors is too high'
      description: 'Pilot has too many push errors during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Envoy sidecars might have outdated configuration'
    expr: >
      sum(irate(pilot_xds_push_errors{job="pilot"}[5m])) / sum(irate(pilot_xds_pushes{job="pilot"}[5m])) > 0.05
    for: 5m

  - alert: IstioMixerPrometheusDispatchesLow
    annotations:
      summary: 'Number of Mixer dispatches to Prometheus is too low'
      description: 'Mixer disptaches to Prometheus has dropped below normal levels during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Istio metrics might not be being exported properly'
    expr: >
      sum(irate(mixer_runtime_dispatches_total{adapter=~"prometheus"}[5m])) < 180
    for: 5m

  - alert: IstioGlobalRequestRateHigh
    annotations:
      summary: 'Istio Global Request Rate High'
      description: 'Istio global request rate is unusually high during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh is higher than normal'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) > 1200
    for: 5m

  - alert: IstioGlobalRequestRateLow
    annotations:
      summary: 'Istio global request rate too low'
      description: 'Istio global request rate is unusually low during the last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The amount of traffic being generated inside the service mesh has dropped below usual levels'
    expr: >
      round(sum(irate(istio_requests_total{reporter="destination"}[5m])), 0.001) < 300
    for: 5m

  - alert: IstioGlobalHTTP5xxRateHigh
    annotations:
      summary: 'Istio Percentage of HTTP 5xx responses is too high'
      description: 'Istio global HTTP 5xx rate is too high in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). The HTTP 5xx errors within the service mesh is unusually high'
    expr: >
       sum(irate(istio_requests_total{reporter="destination", response_code=~"5.*"}[5m])) / sum(irate(istio_requests_total{reporter="destination"}[5m])) > 0.01
    for: 5m

  - alert: IstioGatewayOutgoingSuccessLow
    annotations:
      summary: 'Istio Gateway outgoing success rate is too low'
      description: 'Istio Gateway success to outbound destinations is too low in last 5m (current value: *{{ printf "%2.0f%%" $value }}*). Inbound traffic may be affected'
    expr: >
      sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls",response_code!~"5.*"}[5m])) /  sum(irate(istio_requests_total{reporter="source", source_workload="istio-ingressgateway",source_workload_namespace="istio-system", connection_security_policy!="mutual_tls"}[5m])) < 0.995
    for: 5m

source

martijncalker avatar Sep 10 '20 08:09 martijncalker

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Dec 26 '20 17:12 stale[bot]