cassandra_exporter icon indicating copy to clipboard operation
cassandra_exporter copied to clipboard

cassandra_stats{name=xxx} not following prometheus naming best practices

Open PedroMSantosD opened this issue 4 years ago • 7 comments

Hi,

We have implemented your exporter on our cassandra infrastructure, but it is bringing down Prometheus due to the large memory footprint required by the "name" label.

Prometheus documentation shows that the current implementation of this exporter does not follow prometheus naming conventions , where you should have each metric representing "something", as your current implementation does with label "name".

Is it possile to replace the metric name cassandra_stats in favour of something more naming-compliant, i.e. cassandra_%yourCurrentNameLael%_units

The metric name specifies the general feature of a system that is measured (e.g. http_requests_total - the total number of HTTP requests received)....

Labels enable Prometheus's dimensional data model: any given combination of labels for the same metric name identifies a particular dimensional instantiation of that metric ....

CAUTION: Remember that every unique combination of key-value label pairs represents a new time series, which can dramatically increase the amount of data stored. Do not use labels to store dimensions with high cardinality (many different label values

Screen Shot 2020-06-02 at 08 28 47

As a reference, a label with more than 8 unique values is considered "highly" cardinal, causing high memory needs, and slow performance when trying to query it. Prometheus supports millions of metric Names , but it is highly sensible to cardinality.

Thanks in advance,

PedroMSantosD avatar Jun 05 '20 09:06 PedroMSantosD

You can check https://github.com/criteo/cassandra_exporter#why-not-make-more-use-of-labels-be-more-prometheus-way- The main reason is that Cassandra export its own metric format (that is path based) through JMX and the exporter is trying to do the minimum work to make it available (extract some labels, convert the value format to float64). Converting to Prometheus best practices would require special parsing to transform path based metrics to label based metrics, which can have a lot of edge cases

geobeau avatar Jun 05 '20 10:06 geobeau

What do you mean by bringing down your Prometheus ?

One alternative is to use relabeling in the scrape config to replace __name__ by name (and replace special chars to _

geobeau avatar Jun 05 '20 10:06 geobeau

Hi, thanks for your prompt reply.

By 'bringing down' I mean the cardinality of the label "name" in the metric cassandra_stats brings my node out of memory.

That being said, relabeling on prometheus causes the memory consumption to become worse, as the HEAD must contain the scraped resources plus the relabeled ones, as shown on the performance graph after enforcing the relabelling of the metrics: Screen Shot 2020-06-08 at 09 20 35

It is True that the series persisted to disk will only be the ones left after relabeling; but the memory issue remains.

Is it not feasible to fix the "cassandra_stats{name='.*'} with the appropiate "cassandra_compliant_metric_name_unit{other_labels='...'} on exporter code?

Thanks in advance,

PedroMSantosD avatar Jun 08 '20 07:06 PedroMSantosD

Are you sure your memory issue is not just because the exporter expose a lot of metrics? It's possible to blacklist some high cardinality metrics to save some memory in the configuration of the exporter.

I can try to mesure the difference in memory usage between 1 metrics with 5000 labels and 5000 metric but you may have to wait a bit.

geobeau avatar Jun 08 '20 16:06 geobeau

Hi, The graph on the former post, shows memory spike on prometheus upon implementation of relabeling rules. Prior to that, this graph shows the effect of the implementation of the exporter on production infrastructure: Screen Shot 2020-06-09 at 08 56 01

This spike and the triggering of alerts is what has raised the issue. Hope it helps?

PedroMSantosD avatar Jun 09 '20 06:06 PedroMSantosD

Hello, sorry for the delay. I tried locally the difference between 50 metrics with 1000 series each vs 5000 metrics with 10 series each and didn't notice any particular difference in memory usage.

I think your increase in memory is expected given the additional number of metrics generated by the exporter. The memory usage is given by the total number of series independent of the cardinality of each metrics.

My advice is to increase your memory limit or blacklist metrics using the blacklist feature of the exporter.

geobeau avatar Jul 02 '20 14:07 geobeau

Hi,

This presents material issues with hitting series limits for cassandra_stats. Example from my logs (gently redacted):

caller=dedupe.go:112 component=remote level=error remote_name=mimir url=https://prom/api/v1/push msg="non-recoverable error" count=500 exemplarCount=0 err="server returned HTTP status 400 Bad Request: user=anonymous: per-metric series limit of 2000 exceeded, please contact administrator to raise it (per-ingester local limit: 1500) for series {__name__=\"cassandra_stats\", app_kubernetes_io_instance=\"c\", app_kubernetes_io_managed_by=\"Helm\", app_kubernetes_io_name=\"cassandra\", ciq_cluster=\"management\", cluster=\"cassandra\", controller_revision_hash=\"cassie-cassandra-7dffd6c9bd\", datacenter=\"datacenter1\", helm_sh_chart=\"cassandra-9.1.19\", instance=\"10.1.1.1:8080\", job=\"kubernetes-pods\", name=\"org:apache:cassandra:metrics:clientrequest:write-node_local:mutationsizehistogram:75thpercentile\", namespace=\"cassandra\", pod=\"c-cassandra-3\", statefulset_kubernetes_io_pod_name=\"c-cassandra-3\"}"

I would encourage the authors of the exporter to properly adhere to Prometheus guidance here.

pnathan avatar Jul 22 '22 21:07 pnathan