cloudstate
cloudstate copied to clipboard
Observability story
Define what metrics can and should be exposed by the platform.
I'd say we should leverage other solutions for metrics collection as much as possible. For example, allow Knative and/or Istio to collect metrics on requests. We should only supply metrics to fill the gaps, including:
- Database access times, for reads and writes
- Snapshot read/write times
- Active entity hit ratio
- Active entities
- Active entity lifetime (ie, how long an entity lives before it gets passivated)
- Recovery event counts
- Entity shard distribution (ie, number of entities per shard)
- Local shard hit ratio (be 1:n where n is the number of nodes, but if we try and implement shard affinities with the load balancer, then could be different).
- User function latency
Telemetry for event-sourced entities in #349, which covers persistence and entity metrics.