scylla-manager
scylla-manager copied to clipboard
Manager has too many metrics
Taken from a cluster with 18 nodes and 540 cores:
This amount of metrics is not useful. It adds too much load with little gain; by default, the manager number of metrics should be proportional to not more than the number of nodes.
@amnonh what is the source of so many metrics? Is it scale by the number of tasks, cores, others?
All SM backup metrics visible in this picture are labeled by:
- cluster ID
- keyspace
- table
- host
This would mean that there is about 2381 tables in mentioned cluster. Is that the case?
@amnonh @tzach @vladzcloudius @karol-kokoszka so what level of granularity would be ok?
Note that sctool progress
also returns per host/keyspace/table progress, so maybe it's ok to decrease metric granularity.
Knowing that both backup (per host) and repair (in general) work table by table, maybe it would be ok to get rid of host
label in those metrics?
ping @amnonh , @tzach - what's the verdict here? Let's improve the situation for 3.2.7
Think about the situation of having a thousand tables on a 60-node cluster; we want to show the repair/backup progress status and pause. Can we do it with ten metrics or less? A hundred? If the number of tasks is limited, having it per task is fine. But we must remember that we don't show tasks and tables per user. This level of granularity could be a table in Scylla or a log.
The agreement is to disable manager per table and node metrics by default. In other words, only cluster-level metrics by default.
Optimistically setting this to 3.2.7 - if we miss it, that's OK.