metric-collector-for-apache-cassandra
metric-collector-for-apache-cassandra copied to clipboard
Metrics are lost after 5 minutes
Hi, I would like to ask for support for the following issue we are suffering in which we are loosing the metrics collector in our Prometheus instance.
Prometheus scrapping.
- job_name: "mcac"
scrape_interval: 30s
scrape_timeout: 20s
honor_labels: true
metrics_path: /metrics
consul_sd_configs:
- server: consul.service.consul:8500
services:
- 'cassandra'
relabel_configs:
- source_labels: [__address__]
action: replace
regex: ([^:]+):.*
replacement: $1:9103
target_label: __address__
metric_relabel_configs:
#drop metrics we can calculate from prometheus directly
- source_labels: [__name__]
regex: .*rate_(mean|1m|5m|15m)
action: drop
#save the original name for all metrics
- source_labels: [__name__]
regex: (collectd_mcac_.+)
target_label: prom_name
replacement: ${1}
Note: We know that more relabel configs are needed but just omitted to test this out. In our case we cannot define a static IPs for the cassandra nodes as instances can take any IPs from a range using consul service discovery to create the enpdoints needed.
Job is properly created and pointed to the right cassandra instances. However, firstly we need to restart the cassandra node to start getting the metrics because of the following log output error
"Error on ingesting samples that are too old or are too far into the future" num_dropped=198729
After doing it, metrics start coming to Prometheus but after exactly 5 minutes, all of them went away.
collectd.log
cassandra@cluster-datacenter1-default-sts-0:/var/log/cassandra$ tail -f cassandra-collectd.log [2022-07-01 07:30:37] plugin_load: plugin "uptime" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "processes" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "tcpconns" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "match_regex" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "target_set" successfully loaded. [2022-07-01 07:30:37] plugin_load: plugin "target_replace" successfully loaded. [2022-07-01 07:30:37] unixsock plugin: Successfully deleted socket file "/tmp/ds-8466130650012646173.sock". [2022-07-01 07:30:37] cpufreq plugin: Found 0 CPUs [2022-07-01 07:30:37] Initialization complete, entering read-loop. [2022-07-01 07:30:38] tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on. [2022-07-01 07:35:53] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:35:54] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:35:57] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:36:07] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:36:17] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics. [2022-07-01 07:36:20] plugin_dispatch_values: Low water mark reached. Dropping 100% of metrics.

Could some help up? Thanks in advance
Hi,
You probably have too many metrics extracted, possibly due to the number of tables in your cluster. Try enabling the following filtering rules to drop some table level metrics and see how it works: https://github.com/datastax/metric-collector-for-apache-cassandra/blob/master/config/metric-collector.yaml#L30-L72
Hi @adejanovski Thanks for your fast response. However, how could we change the metric-collector.yaml and restart the mcac process which is running inside the Cassandra pod deployed by cass-operator? If we delete the pod, it will be recreated by the operator, but the new config will be lost. Thanks in advance, BR
@jpicara,
I hadn't noticed that you were using K8ssandra. Which version are you using? K8ssandra v1.x or K8ssandra-operator? For the former, filters are in place by default.
For the latter, it's still a WIP and should be released in K8ssandra-operator v1.2 which we're planning to release during the second half of July: https://github.com/k8ssandra/k8ssandra-operator/issues/573
It's old cass-operator. We are planning the migration to k8ssandra-operator, but not performed yet Thanks!
In cass-operator, you'll need to set the METRIC_FILTERS env variable for the cassandra container in the pod template spec, with the desired filters:
podTemplateSpec:
spec:
containers:
- env:
- name: LOCAL_JMX
value: 'no'
- name: METRIC_FILTERS
value: >-
deny:org.apache.cassandra.metrics.Table
deny:org.apache.cassandra.metrics.table
allow:org.apache.cassandra.metrics.table.live_ss_table_count
allow:org.apache.cassandra.metrics.Table.LiveSSTableCount
allow:org.apache.cassandra.metrics.table.live_disk_space_used
allow:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
allow:org.apache.cassandra.metrics.Table.Pending
allow:org.apache.cassandra.metrics.Table.Memtable
allow:org.apache.cassandra.metrics.Table.Compaction
allow:org.apache.cassandra.metrics.table.read
allow:org.apache.cassandra.metrics.table.write
allow:org.apache.cassandra.metrics.table.range
allow:org.apache.cassandra.metrics.table.coordinator
allow:org.apache.cassandra.metrics.table.dropped_mutations
name: cassandra
securityContext:
runAsNonRoot: true
This feature is supported by the management api v0.1.32 and onwards. If you're using the image version 3.11.11 for example, it's rebuilt with the latest version of the mgmt api when it's made available. Otherwise it means you need to be at least on 3.11.11-v0.1.32 (3.11.11 being an example Cassandra version, you can use any supported version).
We are currently using this image : datastax/cassandra-mgmtapi-3_11_7:v0.1.22, so we don't have the mgmtapi you mentioned. Should we move and start using the images from k8ssandra dockerhub? like this one: k8ssandra/cass-management-api:3.11.7-v0.1.42? Thanks!
Yes, I'd recommend to upgrade to 3.11.7-v0.1.42 which will allow to pass the metrics filters. We stopped publishing images to the datastax org a while ago now so you'd better make the switch to the k8ssandra org images indeed.
@jpicara Were you able to upgrade the management-api and apply the filters? If so, did it resolve your issue?
@Miles-Garnsey Can you look into this?
The collectd log reports that metrics were dropped after the low water mark was reached. I am curious as to whether this is related to the WriteQueueLimitHigh and WriteQueueLimitLow settings for collectd. The docs say this:
You can set the limits using WriteQueueLimitHigh and WriteQueueLimitLow. Each of them takes a numerical argument which is the number of metrics in the queue. If there are HighNum metrics in the queue, any new metrics will be dropped. If there are less than LowNum metrics in the queue, all new metrics will be enqueued. If the number of metrics currently in the queue is between LowNum and HighNum, the metric is dropped with a probability that is proportional to the number of metrics in the queue (i.e. it increases linearly until it reaches 100%
These settings are hard coded in collectd.conf.tmpl. That might be ok for non-k8s deployments, but it definitely seems like something that should be tunable in k8s deployments.
@jsanda it seems that the original issue has been resolved, unless @jpicara has encountered additional issues after upgrading management API and implementing the metrics filters suggested by @adejanovski?
Hello,
Sorry for my delay. We were finally able to upgrade the version to /k8ssandra/k8ssandra-operator:v1.2.0 and k8ssandra/cass-operator:v1.12.0 and could apply some filtering to drop table level metrics like.
- name: METRIC_FILTERS value: >- deny:org.apache.cassandra.metrics.table deny:org.apache.cassandra.metrics.table.live_ss_table_count deny:org.apache.cassandra.metrics.Table.LiveSSTableCount deny:org.apache.cassandra.metrics.table.live_disk_space_used deny:org.apache.cassandra.metrics.table.LiveDiskSpaceUsed deny:org.apache.cassandra.metrics.Table.Pending deny:org.apache.cassandra.metrics.Table.Memtable deny:org.apache.cassandra.metrics.Table.Compaction deny:org.apache.cassandra.metrics.table.read deny:org.apache.cassandra.metrics.table.write deny:org.apache.cassandra.metrics.table.range deny:org.apache.cassandra.metrics.table.coordinator deny:org.apache.cassandra.metrics.table.dropped_mutations
Doing it, metrics seem to be stable and grafana dashboard are now working fine (for sure some graphs are empty because of these filtering rules).
I think the issue can be set as closed.
Thanks for your support!
That's great news, thanks for the update @jpicara