osm icon indicating copy to clipboard operation
osm copied to clipboard

Document metrics/alerts required for multicluster

Open steeling opened this issue 1 year ago • 2 comments

What are new metrics or recommended alerts that we may need for multicluster?

steeling avatar Aug 09 '22 00:08 steeling

There are some metrics that should be provided by MCS API implementation rather than multicluster service mesh. But I'm putting them all together here.

Metrics

In participating cluster

  • Number of ServiceImport ServiceExport resources. This should be equivalent to number of services imported and exported
  • Rate of sync request with the broker, group by status
  • Cross cluster request metrics, group by service, source cluster, destination cluster.
    • Request rate
    • Error rate
    • Latency
  • For each imported services
    • number of endpoint slices, by exporting cluster
    • total number of endpoints
  • Gateway data transfer rate
  • Gateway request rate

In broker

  • Number of joined clusters, group by status
  • Broker sync request rate

Alerts

  • Cross cluster request error rate is higher than the threshold, e.g. 1% of total requests or 1 rpm
  • Cross cluster request latency higher than normal (e.g. recent 5m p99 is higher than 50% of average p99 in the last 24h)
  • Sync request error rate is higher than the threshold
  • Cluster in connecting status for too long
  • Cluster failed to join the multicluster mesh

References

https://submariner.io/operations/monitoring/

allenlsy avatar Aug 12 '22 21:08 allenlsy

ping @steeling for comments here.

allenlsy avatar Sep 08 '22 19:09 allenlsy

This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.

github-actions[bot] avatar Dec 12 '22 00:12 github-actions[bot]

Issue closed due to inactivity.

github-actions[bot] avatar Dec 20 '22 00:12 github-actions[bot]