metric-collector-for-apache-cassandra Remove timestamp from the metrics

The timestamp in the metrics is 2 hours behind the system time.

 HELP collectd_collectd_cache_size write_prometheus plugin: 'collectd' Type: 'cache_size', Dstype: 'gauge', Dsname: 'value'
# TYPE collectd_collectd_cache_size gauge
collectd_collectd_cache_size{collectd="cache",instance="10.0.1.1",cluster="CassCluster",dc="DAL",rack="rack1"} 11969 1617137796120

Here's the system time and timestamp it translates to

$ date -d @1617137796
Tue Mar 30 13:56:36 GMT+7 2021
$ date
Tue Mar 30 15:39:22 GMT+7 2021

The time reported in metric is 2 hours behind and I can't figure out the way to disable the timestamp in metrics.

This is causing the following error when scrapping in prometheus

msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=51908

Mar 30 '21 22:03 prabinsh

I've also got a similar issue... Our k8ssandra nodes ran out of disk space, we fixed the issue, but ever since, we've had no grafana metrics from k8ssandra... I've restarted, deleted and recreated every pod an servicemonitor, removed coutless directories / caches and it just keeps happening. The timestamps are out by about 5 minutes immediately after a delete of mcac_data and restart, then get older and older until they are bout 4 hours old, then start moving forwards...

Any advice or help about what to grab for diagnosis would be great, but this really feels like a bug of some sort, induced by an unexpected state...

Aug 11 '21 08:08 MattFellows

Same here.

I got the following metrics with the timestamps that are 2 months ago:

collectd_tcpconns_tcp_connections{tcpconns="9999-local",type="SYN_SENT",instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 0 1624932662653
collectd_uptime{instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 10100889 1624932662650
collectd_vmem_vmpage_action_total{vmem="dirtied",instance="172.17.47.22",cluster="V2",dc="F1",rack="D1"} 20647291651 1624932662647

1624932662650 GMT: Tuesday, June 29, 2021 2:11:02.650 AM Relative: 2 months ago

Causing the Prometheus drops those metrics. Not sure why the mcac doesn't update the timestamp.

Please advise.

Aug 16 '21 03:08 kenjaix

We got the same issue. It was working fine, but after leaving it for a couple of days, mcac is reporting the wrong time causing prometheus to fail: level=warn ts=2021-08-20T15:29:17.688Z caller=scrape.go:1375 component="scrape manager" scrape_pool=k8ssandra/k8ssandra-prometheus-k8ssandra/0 target=http://xxxxx:9103/metrics msg="Error on ingesting samples that are too old or are too far into the future" num_dropped=205

Please fix.

Aug 20 '21 15:08 tah-mas

I'm unable to reproduce the issue on GKE. I've left the cluster run for a few days and Prometheus isn't complaining about metrics that are too old. Could you compare the clocks from the Prometheus container and from the Cassandra containers to see if there's a drift? Same question on comparing the clocks on all K8s worker nodes to see if they're in sync.

Sep 14 '21 09:09 adejanovski

Hi, @adejanovski no drift and both prometheus and cassandra container report the same time (UTC). I did notice that with the fix for 'out-of-order timespaces' (#969), I had no problems with the timestamps as long as I had a smaller number of tables (~100) in the DB. After our production upgrade, I now have 326 tables spread across keyspaces and the problem has reappeared again. Our dev env has also got a similar number of tables so it appears that this happens if you've got a large number of tables in your DB, but that is just an observation...

Sep 27 '21 09:09 tah-mas

Hi @tah-mas,

that's an interesting observation. Each table comes with a large set of metrics and this could mean that they take too long to be ingested and end up being ingested once they're outside of the accepted timestamp range. The solution there would be to filter some metrics so that we reduce the overall volume. I'm not even sure we have table specific metrics used in the current set of dashboards. I'll investigate to see how easily this could be achieved.

Sep 28 '21 03:09 adejanovski

Thank you @adejanovski! Much appreciated

Sep 28 '21 07:09 tah-mas

@adejanovski Any ETA making the default config usable?

We just switched from the instaclustr exporter to MCAC and are winding up with no metrics/blank dashboards from our main cluster due to this issue, despite it working fine on a smaller cluster with fewer tables.

Dec 31 '21 06:12 eriksw

Hi @eriksw,

we merged the changes a while ago actually to let you filter metrics more easily. Check this commit for some examples. Let me know how this works for you.

Jan 03 '22 07:01 adejanovski

@adejanovski Glad to see some rules documented here! I had looked around and found https://github.com/k8ssandra/k8ssandra/pull/1149/files and derived the following rule set:

filtering_rules:
  - policy: deny
    pattern: org.apache.cassandra.metrics.Table
    scope: global
  - policy: deny
    pattern: org.apache.cassandra.metrics.table
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.live_ss_table_count
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.LiveSSTableCount
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.live_disk_space_used
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.LiveDiskSpaceUsed
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.Pending
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.Memtable
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.Table.Compaction
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.read
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.write
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.range
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.coordinator
    scope: global
  - policy: allow
    pattern: org.apache.cassandra.metrics.table.dropped_mutations
    scope: global

The bad news: with those rules, on our main cluster we still ran into wildly out of date metric timestamps and all the other issues of https://github.com/datastax/metric-collector-for-apache-cassandra/issues/39

Has MCAC ever been used in actual production on a cluster with >300 tables on 60 nodes? If so, how?

Jan 03 '22 18:01 eriksw

Hi everyone

The rate of having Prometheus warning out-of-order samples indeed decrease with above setup.

Increase metric_sampling_interval_in_seconds: 120 does help a bit.

I was from having scrape warning every minutes to every 3-4 mins.

I'm testing MCAC in a 3-nodes-cluster with 100+ tables. Prometheus/ServiceMonitor deployed in k8s cluster. Cassandra in VM Instances.

Feb 21 '22 15:02 ducnm0711

Hi everyone We're having the same issue. The mcac exporter metrics are timestamped 2hours in the past compared to our France current UTC+2. All servers are NTP synced. So I think the exporter gets the time from Cassandra and not from the system. If there's no way to configure it, the simplest way would be to change prometheus server timezone to match UTC ?

May 06 '22 08:05 raskar7

@Miles-Garnsey can you investigate this? Could this be related to #73?

Aug 07 '22 20:08 jsanda

metric-collector-for-apache-cassandra metric-collector-for-apache-cassandra copied to clipboard

Remove timestamp from the metrics

metric-collector-for-apache-cassandra
metric-collector-for-apache-cassandra copied to clipboard