metric-collector-for-apache-cassandra
metric-collector-for-apache-cassandra copied to clipboard
Remove timestamp from the metrics
The timestamp in the metrics is 2 hours behind the system time.
HELP collectd_collectd_cache_size write_prometheus plugin: 'collectd' Type: 'cache_size', Dstype: 'gauge', Dsname: 'value'
# TYPE collectd_collectd_cache_size gauge
collectd_collectd_cache_size{collectd="cache",instance="10.0.1.1",cluster="CassCluster",dc="DAL",rack="rack1"} 11969 1617137796120
Here's the system time and timestamp it translates to
$ date -d @1617137796
Tue Mar 30 13:56:36 GMT+7 2021
$ date
Tue Mar 30 15:39:22 GMT+7 2021
The time reported in metric is 2 hours behind and I can't figure out the way to disable the timestamp in metrics.
This is causing the following error when scrapping in prometheus
msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=51908
I've also got a similar issue... Our k8ssandra nodes ran out of disk space, we fixed the issue, but ever since, we've had no grafana metrics from k8ssandra... I've restarted, deleted and recreated every pod an servicemonitor, removed coutless directories / caches and it just keeps happening. The timestamps are out by about 5 minutes immediately after a delete of mcac_data and restart, then get older and older until they are bout 4 hours old, then start moving forwards...
Any advice or help about what to grab for diagnosis would be great, but this really feels like a bug of some sort, induced by an unexpected state...
Same here.
I got the following metrics with the timestamps that are 2 months ago:
collectd_tcpconns_tcp_connections{tcpconns="9999-local",type="SYN_SENT",instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 0 1624932662653
collectd_uptime{instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 10100889 1624932662650
collectd_vmem_vmpage_action_total{vmem="dirtied",instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 20647291651 1624932662647
1624932662650 GMT: Tuesday, June 29, 2021 2:11:02.650 AM Relative: 2 months ago
Causing the Prometheus drops those metrics. Not sure why the mcac doesn't update the timestamp.
Please advise.
We got the same issue. It was working fine, but after leaving it for a couple of days, mcac is reporting the wrong time causing prometheus to fail: level=warn ts=2021-08-20T15:29:17.688Z caller=scrape.go:1375 component="scrape manager" scrape_pool=k8ssandra/k8ssandra-prometheus-k8ssandra/0 target=http://xxxxx:9103/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=205
Please fix.
I'm unable to reproduce the issue on GKE. I've left the cluster run for a few days and Prometheus isn't complaining about metrics that are too old. Could you compare the clocks from the Prometheus container and from the Cassandra containers to see if there's a drift? Same question on comparing the clocks on all K8s worker nodes to see if they're in sync.
Hi, @adejanovski no drift and both prometheus and cassandra container report the same time (UTC). I did notice that with the fix for 'out-of-order timespaces' (#969), I had no problems with the timestamps as long as I had a smaller number of tables (~100) in the DB. After our production upgrade, I now have 326 tables spread across keyspaces and the problem has reappeared again. Our dev env has also got a similar number of tables so it appears that this happens if you've got a large number of tables in your DB, but that is just an observation...
Hi @tah-mas,
that's an interesting observation. Each table comes with a large set of metrics and this could mean that they take too long to be ingested and end up being ingested once they're outside of the accepted timestamp range. The solution there would be to filter some metrics so that we reduce the overall volume. I'm not even sure we have table specific metrics used in the current set of dashboards. I'll investigate to see how easily this could be achieved.
Thank you @adejanovski! Much appreciated
@adejanovski Any ETA making the default config usable?
We just switched from the instaclustr exporter to MCAC and are winding up with no metrics/blank dashboards from our main cluster due to this issue, despite it working fine on a smaller cluster with fewer tables.
Hi @eriksw,
we merged the changes a while ago actually to let you filter metrics more easily. Check this commit for some examples. Let me know how this works for you.
@adejanovski Glad to see some rules documented here! I had looked around and found https://github.com/k8ssandra/k8ssandra/pull/1149/files and derived the following rule set:
filtering_rules:
- policy: deny
pattern: org.apache.cassandra.metrics.Table
scope: global
- policy: deny
pattern: org.apache.cassandra.metrics.table
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.live_ss_table_count
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.Table.LiveSSTableCount
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.live_disk_space_used
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.Table.Pending
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.Table.Memtable
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.Table.Compaction
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.read
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.write
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.range
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.coordinator
scope: global
- policy: allow
pattern: org.apache.cassandra.metrics.table.dropped_mutations
scope: global
The bad news: with those rules, on our main cluster we still ran into wildly out of date metric timestamps and all the other issues of https://github.com/datastax/metric-collector-for-apache-cassandra/issues/39
Has MCAC ever been used in actual production on a cluster with >300 tables on 60 nodes? If so, how?
Hi everyone
The rate of having Prometheus warning out-of-order samples
indeed decrease with above setup.
Increase metric_sampling_interval_in_seconds: 120
does help a bit.
I was from having scrape warning every minutes to every 3-4 mins.
I'm testing MCAC in a 3-nodes-cluster with 100+ tables. Prometheus/ServiceMonitor deployed in k8s cluster. Cassandra in VM Instances.
Hi everyone We're having the same issue. The mcac exporter metrics are timestamped 2hours in the past compared to our France current UTC+2. All servers are NTP synced. So I think the exporter gets the time from Cassandra and not from the system. If there's no way to configure it, the simplest way would be to change prometheus server timezone to match UTC ?
@Miles-Garnsey can you investigate this? Could this be related to #73?