osm
osm copied to clipboard
Document metrics/alerts required for multicluster
What are new metrics or recommended alerts that we may need for multicluster?
There are some metrics that should be provided by MCS API implementation rather than multicluster service mesh. But I'm putting them all together here.
Metrics
In participating cluster
- Number of
ServiceImport
ServiceExport
resources. This should be equivalent to number of services imported and exported - Rate of sync request with the broker, group by status
- Cross cluster request metrics, group by service, source cluster, destination cluster.
- Request rate
- Error rate
- Latency
- For each imported services
- number of endpoint slices, by exporting cluster
- total number of endpoints
- Gateway data transfer rate
- Gateway request rate
In broker
- Number of joined clusters, group by status
- Broker sync request rate
Alerts
- Cross cluster request error rate is higher than the threshold, e.g. 1% of total requests or 1 rpm
- Cross cluster request latency higher than normal (e.g. recent 5m p99 is higher than 50% of average p99 in the last 24h)
- Sync request error rate is higher than the threshold
- Cluster in connecting status for too long
- Cluster failed to join the multicluster mesh
References
https://submariner.io/operations/monitoring/
ping @steeling for comments here.
This issue will be closed due to a long period of inactivity. If you would like this issue to remain open then please comment or update.
Issue closed due to inactivity.