scylla-manager Manager has too many metrics

Taken from a cluster with 18 nodes and 540 cores:

This amount of metrics is not useful. It adds too much load with little gain; by default, the manager number of metrics should be proportional to not more than the number of nodes.

Feb 26 '24 13:02 amnonh

@amnonh what is the source of so many metrics? Is it scale by the number of tasks, cores, others?

Feb 26 '24 13:02 tzach

All SM backup metrics visible in this picture are labeled by:

cluster ID
keyspace
table
host

This would mean that there is about 2381 tables in mentioned cluster. Is that the case?

Feb 26 '24 18:02 Michal-Leszczynski

@amnonh @tzach @vladzcloudius @karol-kokoszka so what level of granularity would be ok? Note that sctool progress also returns per host/keyspace/table progress, so maybe it's ok to decrease metric granularity.

Knowing that both backup (per host) and repair (in general) work table by table, maybe it would be ok to get rid of host label in those metrics?

Mar 01 '24 10:03 Michal-Leszczynski

ping @amnonh , @tzach - what's the verdict here? Let's improve the situation for 3.2.7

Mar 11 '24 07:03 mykaul

Think about the situation of having a thousand tables on a 60-node cluster; we want to show the repair/backup progress status and pause. Can we do it with ten metrics or less? A hundred? If the number of tasks is limited, having it per task is fine. But we must remember that we don't show tasks and tables per user. This level of granularity could be a table in Scylla or a log.

Mar 11 '24 09:03 amnonh

The agreement is to disable manager per table and node metrics by default. In other words, only cluster-level metrics by default.

Mar 11 '24 10:03 tzach

Optimistically setting this to 3.2.7 - if we miss it, that's OK.

Mar 13 '24 12:03 mykaul

scylla-manager scylla-manager copied to clipboard

Manager has too many metrics

scylla-manager
scylla-manager copied to clipboard