metric-collector-for-apache-cassandra Metrics are lost after 5 minutes

Hi, I would like to ask for support for the following issue we are suffering in which we are loosing the metrics collector in our Prometheus instance.

Prometheus scrapping.

- job_name: "mcac"
  scrape_interval: 30s
  scrape_timeout:  20s
  honor_labels: true
  metrics_path: /metrics
  consul_sd_configs:
  - server: consul.service.consul:8500
    services:
      - 'cassandra'
  relabel_configs:
  - source_labels: [__address__]
    action: replace
    regex: ([^:]+):.*
    replacement: $1:9103
    target_label: __address__
  metric_relabel_configs:
   #drop metrics we can calculate from prometheus directly
   - source_labels: [__name__]
     regex: .*rate_(mean|1m|5m|15m)
     action: drop
     #save the original name for all metrics
   - source_labels: [__name__]
     regex: (collectd_mcac_.+)
     target_label: prom_name
     replacement: ${1}

Note: We know that more relabel configs are needed but just omitted to test this out. In our case we cannot define a static IPs for the cassandra nodes as instances can take any IPs from a range using consul service discovery to create the enpdoints needed.

Job is properly created and pointed to the right cassandra instances. However, firstly we need to restart the cassandra node to start getting the metrics because of the following log output error "Error on ingesting samples that are too old or are too far into the future" num_dropped=198729 After doing it, metrics start coming to Prometheus but after exactly 5 minutes, all of them went away.

collectd.log cassandra@cluster-datacenter1-default-sts-0:/var/log/cassandra$ tail -f cassandra-collectd.log [2022-07-01 07:30:37] plugin_load: plugin "uptime" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "processes" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "tcpconns" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "match_regex" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "target_set" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "target_replace" successfully loaded. [2022-07-01 07:30:37] unixsock plugin: Successfully deleted socket file "/tmp/ds-8466130650012646173.sock". [2022-07-01 07:30:37] cpufreq plugin: Found 0 CPUs [2022-07-01 07:30:37] Initialization complete, entering read-loop. [2022-07-01 07:30:38] tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on. [2022-07-01 07:35:53] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:35:54] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:35:57] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:36:07] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:36:17] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:36:20] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.

Could some help up? Thanks in advance

Jul 01 '22 07:07 jpicara

Hi,

You probably have too many metrics extracted, possibly due to the number of tables in your cluster. Try enabling the following filtering rules to drop some table level metrics and see how it works: https://github.com/datastax/metric-collector-for-apache-cassandra/blob/master/config/metric-collector.yaml#L30-L72

Jul 04 '22 10:07 adejanovski

Hi @adejanovski Thanks for your fast response. However, how could we change the metric-collector.yaml and restart the mcac process which is running inside the Cassandra pod deployed by cass-operator? If we delete the pod, it will be recreated by the operator, but the new config will be lost. Thanks in advance, BR

Jul 04 '22 11:07 jpicara

@jpicara,

I hadn't noticed that you were using K8ssandra. Which version are you using? K8ssandra v1.x or K8ssandra-operator? For the former, filters are in place by default.

For the latter, it's still a WIP and should be released in K8ssandra-operator v1.2 which we're planning to release during the second half of July: https://github.com/k8ssandra/k8ssandra-operator/issues/573

Jul 04 '22 11:07 adejanovski

It's old cass-operator. We are planning the migration to k8ssandra-operator, but not performed yet Thanks!

Jul 04 '22 12:07 jpicara

In cass-operator, you'll need to set the METRIC_FILTERS env variable for the cassandra container in the pod template spec, with the desired filters:

  podTemplateSpec:
    spec:
      containers:
        - env:
            - name: LOCAL_JMX
              value: 'no'
            - name: METRIC_FILTERS
              value: >-
                deny:org.apache.cassandra.metrics.Table
                deny:org.apache.cassandra.metrics.table
                allow:org.apache.cassandra.metrics.table.live_ss_table_count
                allow:org.apache.cassandra.metrics.Table.LiveSSTableCount
                allow:org.apache.cassandra.metrics.table.live_disk_space_used
                allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
                allow:org.apache.cassandra.metrics.Table.Pending
                allow:org.apache.cassandra.metrics.Table.Memtable
                allow:org.apache.cassandra.metrics.Table.Compaction
                allow:org.apache.cassandra.metrics.table.read
                allow:org.apache.cassandra.metrics.table.write
                allow:org.apache.cassandra.metrics.table.range
                allow:org.apache.cassandra.metrics.table.coordinator
                allow:org.apache.cassandra.metrics.table.dropped_mutations
          name: cassandra
          securityContext:
            runAsNonRoot: true

This feature is supported by the management api v0.1.32 and onwards. If you're using the image version 3.11.11 for example, it's rebuilt with the latest version of the mgmt api when it's made available. Otherwise it means you need to be at least on 3.11.11-v0.1.32 (3.11.11 being an example Cassandra version, you can use any supported version).

Jul 04 '22 13:07 adejanovski

We are currently using this image : datastax/cassandra-mgmtapi-3_11_7:v0.1.22, so we don't have the mgmtapi you mentioned. Should we move and start using the images from k8ssandra dockerhub? like this one: k8ssandra/cass-management-api:3.11.7-v0.1.42? Thanks!

Jul 04 '22 14:07 jpicara

Yes, I'd recommend to upgrade to 3.11.7-v0.1.42 which will allow to pass the metrics filters. We stopped publishing images to the datastax org a while ago now so you'd better make the switch to the k8ssandra org images indeed.

Jul 04 '22 14:07 adejanovski

@jpicara Were you able to upgrade the management-api and apply the filters? If so, did it resolve your issue?

Aug 04 '22 16:08 jsanda

@Miles-Garnsey Can you look into this?

The collectd log reports that metrics were dropped after the low water mark was reached. I am curious as to whether this is related to the WriteQueueLimitHigh and WriteQueueLimitLow settings for collectd. The docs say this:

You can set the limits using WriteQueueLimitHigh and WriteQueueLimitLow. Each of them takes a numerical argument which is the number of metrics in the queue. If there are HighNum metrics in the queue, any new metrics will be dropped. If there are less than LowNum metrics in the queue, all new metrics will be enqueued. If the number of metrics currently in the queue is between LowNum and HighNum, the metric is dropped with a probability that is proportional to the number of metrics in the queue (i.e. it increases linearly until it reaches 100%

These settings are hard coded in collectd.conf.tmpl. That might be ok for non-k8s deployments, but it definitely seems like something that should be tunable in k8s deployments.

Aug 07 '22 20:08 jsanda

@jsanda it seems that the original issue has been resolved, unless @jpicara has encountered additional issues after upgrading management API and implementing the metrics filters suggested by @adejanovski?

Aug 08 '22 05:08 Miles-Garnsey

Hello, Sorry for my delay. We were finally able to upgrade the version to /k8ssandra/k8ssandra-operator:v1.2.0 and k8ssandra/cass-operator:v1.12.0 and could apply some filtering to drop table level metrics like. - name: METRIC_FILTERS value: >- deny:org.apache.cassandra.metrics.table deny:org.apache.cassandra.metrics.table.live_ss_table_count deny:org.apache.cassandra.metrics.Table.LiveSSTableCount deny:org.apache.cassandra.metrics.table.live_disk_space_used deny:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed deny:org.apache.cassandra.metrics.Table.Pending deny:org.apache.cassandra.metrics.Table.Memtable deny:org.apache.cassandra.metrics.Table.Compaction deny:org.apache.cassandra.metrics.table.read deny:org.apache.cassandra.metrics.table.write deny:org.apache.cassandra.metrics.table.range deny:org.apache.cassandra.metrics.table.coordinator deny:org.apache.cassandra.metrics.table.dropped_mutations Doing it, metrics seem to be stable and grafana dashboard are now working fine (for sure some graphs are empty because of these filtering rules). I think the issue can be set as closed. Thanks for your support!

Sep 13 '22 06:09 jpicara

That's great news, thanks for the update @jpicara

Sep 13 '22 06:09 adejanovski

metric-collector-for-apache-cassandra metric-collector-for-apache-cassandra copied to clipboard

Metrics are lost after 5 minutes

metric-collector-for-apache-cassandra
metric-collector-for-apache-cassandra copied to clipboard