metric-collector-for-apache-cassandra
metric-collector-for-apache-cassandra copied to clipboard
cannot start collectd
I tried installing this on a sample cluster I have and couple issues i came across running on apache-cassandra 3.11.5 and centos 7.5 (CentOS Linux release 7.5.1804 (Core)). install is simple and easy
- null pointer exception on startup because broadcast_address wasn't specifically set in cassandra.yaml.... after some digging and trying couple things i found that if i just set broadcast_address same as listen_address in the yaml that cleared up this null pointer issue
INFO [CompactionExecutor:30] 2020-05-07 14:07:08,918 CompactionEndedInterceptor.java:71 - Problem collection compaction end report
java.lang.NullPointerException: null
at org.apache.cassandra.locator.GossipingPropertyFileSnitch.getDatacenter(GossipingPropertyFileSnitch.java:89) ~[apache-cassandra-3.11.5.jar:3.11.5]
at com.datastax.mcac.UnixSocketClient.getDataCenter(UnixSocketClient.java:227) ~[datastax-mcac-agent.jar:na]
at com.datastax.mcac.UnixSocketClient.<init>(UnixSocketClient.java:169) ~[datastax-mcac-agent.jar:na]
at com.datastax.mcac.UnixSocketClient.<init>(UnixSocketClient.java:152) ~[datastax-mcac-agent.jar:na]
at com.datastax.mcac.interceptors.AbstractInterceptor.lambda$static$0(AbstractInterceptor.java:13) ~[datastax-mcac-agent.jar:na]
at com.google.common.base.Suppliers$MemoizingSupplier.get(Suppliers.java:125) ~[guava-18.0.jar:na]
at com.datastax.mcac.interceptors.CompactionEndedInterceptor.intercept(CompactionEndedInterceptor.java:65) ~[datastax-mcac-agent.jar:na]
at org.apache.cassandra.db.compaction.CompactionIterator.getMergedRowCounts(CompactionIterator.java) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.compaction.CompactionTask.runMayThrow(CompactionTask.java:227) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.compaction.CompactionTask.executeInternal(CompactionTask.java:85) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.compaction.AbstractCompactionTask.execute(AbstractCompactionTask.java:61) [apache-cassandra-3.11.5.jar:3.11.5]
at org.apache.cassandra.db.compaction.CompactionManager$BackgroundCompactionCandidate.run(CompactionManager.java:268) [apache-cassandra-3.11.5.jar:3.11.5]
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_241]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_241]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_241]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_241]
at org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) [apache-cassandra-3.11.5.jar:3.11.5]
- collectd cannot start
INFO [main] 2020-05-07 17:06:28,244 Server.java:159 - Starting listening for CQL clients on /0.0.0.0:9042 (unencrypted)...
INFO [main] 2020-05-07 17:06:28,388 ThriftServer.java:116 - Binding thrift service to /0.0.0.0:9160
INFO [Thread-2] 2020-05-07 17:06:28,399 ThriftServer.java:133 - Listening for thrift clients...
INFO [main] 2020-05-07 17:06:28,411 CassandraDaemonInterceptor.java:49 - Starting DataStax Metric Collector for Apache Cassandra 0.1.1
INFO [main] 2020-05-07 17:06:28,426 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
ERROR [main] 2020-05-07 17:06:33,669 CollectdController.java:355 - Collectd start failed
WARN [main] 2020-05-07 17:06:33,670 UnixSocketClient.java:299 - Not able to start collectd, will keep trying: UNKNOWN
INFO [main] 2020-05-07 17:06:33,898 UnixSocketClient.java:777 - Starting metric reporting with 30 sec interval
INFO [main] 2020-05-07 17:06:33,901 UnixSocketClient.java:697 - Starting event reporting with 300 sec interval
WARN [main] 2020-05-07 17:06:33,904 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [insights-3-1] 2020-05-07 17:06:38,907 UnixSocketClient.java:707 - Event reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 17:06:38,993 UnixSocketClient.java:719 - didn't report node config
WARN [insights-3-1] 2020-05-07 17:06:39,011 UnixSocketClient.java:738 - didn't report system info
WARN [insights-3-1] 2020-05-07 17:06:39,061 UnixSocketClient.java:757 - didn't report schema info
in syslog i see messages like
May 7 17:18:47 cass-410322564-1-946521678 kernel: ld-2.23.so[4566]: segfault at 2addc02f5000 ip 00002addbaa4faf8 sp 00002addc02ee978 error 6 in libc.so.6[2addba8dd000+1c0000]
hi @patrickclee0207, thanks for the report! I'll take a look at both issues and update this ticket
@patrickclee0207 I'm trying to build the bundle with an earlier libc version, I think the issue is centos7 uses 2.17.
Do you see anything in the /var/log/cassandra/cassandra-collectd.log file?
all i see in that log is, same thing every 30 secs.. nothing that really stood out.. until i saw that kernel message in syslog.
[2020-05-07 20:54:45] type = logfile, key = LogLevel, value = info
[2020-05-07 20:54:45] plugin_load: plugin "write_scribe" successfully loaded.
[2020-05-07 20:54:45] write_scribe plugin: Ignoring config option `SocketFile'.
[2020-05-07 20:54:45] plugin_load: plugin "write_prometheus" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "cpu" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "df" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "vmem" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "numa" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "contextswitch" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "cpufreq" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "irq" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "protocols" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "ethstat" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "disk" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "interface" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "load" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "memory" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "swap" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "uptime" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "processes" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "tcpconns" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "match_regex" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "target_set" successfully loaded.
[2020-05-07 20:54:45] plugin_load: plugin "target_replace" successfully loaded.
[2020-05-07 20:54:45] unixsock plugin: Successfully deleted socket file "/tmp/ds-1294574779185184502.sock".
[2020-05-07 20:54:45] cpufreq plugin: Found 0 CPUs
[2020-05-07 20:54:45] Initialization complete, entering read-loop.
[2020-05-07 20:54:45] tcpconns plugin: Reading from netlink succeeded. Will use the netlink method from now on.
I ran cassandra 3.11.5 and centos 7.5 (CentOS Linux release 7.5.1804 (Core)) with no issues ... could you edit the collectd.conf.tmpl file and change the logfile loglevel to debug?
i was a little worried that it is probably something specific with mine, i tried on a few different nodes but all are setup the same way.
thats a lot.. this is all of the logs from 1 of the runs..
[2020-05-07 21:27:31] type = logfile, key = LogLevel, value = debug
[2020-05-07 21:27:31] plugin_load: plugin "write_scribe" successfully loaded.
[2020-05-07 21:27:31] write_scribe plugin: Ignoring config option `SocketFile'.
[2020-05-07 21:27:31] plugin_load: plugin "write_prometheus" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "cpu" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "df" successfully loaded.
[2020-05-07 21:27:31] type = df, key = FsType, value = ext3
[2020-05-07 21:27:31] type = df, key = FsType, value = ext4
[2020-05-07 21:27:31] type = df, key = FsType, value = xfs
[2020-05-07 21:27:31] type = df, key = FsType, value = zfs
[2020-05-07 21:27:31] type = df, key = IgnoreSelected, value = False
[2020-05-07 21:27:31] plugin_load: plugin "vmem" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "numa" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "contextswitch" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "cpufreq" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "irq" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "protocols" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "ethstat" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "disk" successfully loaded.
[2020-05-07 21:27:31] type = disk, key = Disk, value = /^loopd+$/
[2020-05-07 21:27:31] type = disk, key = Disk, value = /^dm-d+$/
[2020-05-07 21:27:31] type = disk, key = IgnoreSelected, value = true
[2020-05-07 21:27:31] plugin_load: plugin "interface" successfully loaded.
[2020-05-07 21:27:31] type = interface, key = Interface, value = /^lod*$/
[2020-05-07 21:27:31] type = interface, key = Interface, value = /^docker.*/
[2020-05-07 21:27:31] type = interface, key = Interface, value = /^t(un|ap)d*$/
[2020-05-07 21:27:31] type = interface, key = Interface, value = /^veth.*$/
[2020-05-07 21:27:31] type = interface, key = IgnoreSelected, value = true
[2020-05-07 21:27:31] plugin_load: plugin "load" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "memory" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "swap" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "uptime" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "processes" successfully loaded.
[2020-05-07 21:27:31] plugin_load: plugin "tcpconns" successfully loaded.
[2020-05-07 21:27:31] ProcessMatch: adding ".*" as criteria to process all.
[2020-05-07 21:27:31] type = tcpconns, key = ListeningPorts, value = true
[2020-05-07 21:27:31] fc_register_match (regex);
[2020-05-07 21:27:31] plugin_load: plugin "match_regex" successfully loaded.
[2020-05-07 21:27:31] fc_register_target (set);
[2020-05-07 21:27:31] plugin_load: plugin "target_set" successfully loaded.
[2020-05-07 21:27:31] fc_register_target (replace);
[2020-05-07 21:27:31] plugin_load: plugin "target_replace" successfully loaded.
[2020-05-07 21:27:31] option = PreCacheChain; value = PreCache;
[2020-05-07 21:27:31] fc_register_target (jump);
[2020-05-07 21:27:31] fc_register_target (stop);
[2020-05-07 21:27:31] fc_register_target (return);
[2020-05-07 21:27:31] fc_register_target (write);
[2020-05-07 21:27:31] interval_g = 10.000;
[2020-05-07 21:27:31] timeout_g = 10;
[2020-05-07 21:27:31] hostname_g = 10.12.49.179;
[2020-05-07 21:27:31] scribe_backend: "Parsing scribe config file /tmp/mcac-collectd-scribe-5608148392903113769.conf"
[2020-05-07 21:27:31] scribe_backend: "got configuration data from file </tmp/mcac-collectd-scribe-5608148392903113769.conf>"
[2020-05-07 21:27:31] scribe_backend: "metric_update_interval_secs 300"
[2020-05-07 21:27:31] scribe_backend: "CATEGORY : default"
[2020-05-07 21:27:31] scribe_backend: "Loading http"
[2020-05-07 21:27:31] scribe_backend: "Creating default store"
[2020-05-07 21:27:31] scribe_backend: "configured <1> stores"
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] unixsock plugin: socket path = /tmp/ds-5857296976520851383.sock
[2020-05-07 21:27:31] write_prometheus plugin: Successfully started microhttpd 0.9.59
[2020-05-07 21:27:31] unixsock plugin: Successfully deleted socket file "/tmp/ds-5857296976520851383.sock".
[2020-05-07 21:27:31] numa plugin: Found 2 nodes.
[2020-05-07 21:27:31] unixsock plugin: Calling accept..
[2020-05-07 21:27:31] cpufreq plugin: Found 0 CPUs
[2020-05-07 21:27:31] plugin_unregister_read: Marked `cpufreq' for removal.
[2020-05-07 21:27:31] pagesize_g = 4096; CONFIG_HZ = 100;
[2020-05-07 21:27:31] plugin_read_thread: Handling `write_scribe//app/cassandra/current/bin/../logs/cassandra-collectd.log'.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] plugin_read_thread: Handling `df'.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] plugin_read_thread: Handling `vmem'.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] plugin_read_thread: Handling `numa'.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] plugin_read_thread: Handling `cpu'.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `numa' plugin took 0.000267 seconds.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `write_scribe//app/cassandra/current/bin/../logs/cassandra-collectd.log' plugin took 0.000326 seconds.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `vmem' plugin took 0.000543 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `vmem' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `vmem' plugin at 1588886861.414.
[2020-05-07 21:27:31] plugin_read_thread: Handling `contextswitch'.
[2020-05-07 21:27:31] plugin_dispatch_values: time = 1588886851.423; interval = 10.000; host = 10.12.49.179; plugin = numa; plugin_instance = node0; type = vmpage_action; type_instance = numa_hit;
[2020-05-07 21:27:31] fc_process_chain (chain = PreCache);
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_topology_tags' rule.
[2020-05-07 21:27:31] utils_mount: (void); COLLECTD_MNTTAB = /etc/mtab
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `write_scribe//app/cassandra/current/bin/../logs/cassandra-collectd.log' plugin at 1588886861.413.
[2020-05-07 21:27:31] utils_mount: new = {dir = /, spec_device = rootfs, type = rootfs, options = rw, device = rw}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys, spec_device = sysfs, type = sysfs, options = rw,nosuid,nodev,noexec,relatime, device = rw,nosuid,nodev,noexec,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /proc, spec_device = proc, type = proc, options = rw,nosuid,nodev,noexec,relatime, device = rw,nosuid,nodev,noexec,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /dev, spec_device = devtmpfs, type = devtmpfs, options = rw,nosuid,size=57740664k,nr_inodes=14435166,mode=755, device = rw,nosuid,size=57740664k,nr_inodes=14435166,mode=755}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/kernel/security, spec_device = securityfs, type = securityfs, options = rw,nosuid,nodev,noexec,relatime, device = rw,nosuid,nodev,noexec,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /dev/shm, spec_device = tmpfs, type = tmpfs, options = rw,nosuid,nodev, device = rw,nosuid,nodev}
[2020-05-07 21:27:31] utils_mount: new = {dir = /dev/pts, spec_device = devpts, type = devpts, options = rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000, device = rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000}
[2020-05-07 21:27:31] utils_mount: new = {dir = /run, spec_device = tmpfs, type = tmpfs, options = rw,nosuid,nodev,mode=755, device = rw,nosuid,nodev,mode=755}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup, spec_device = tmpfs, type = tmpfs, options = ro,nosuid,nodev,noexec,mode=755, device = ro,nosuid,nodev,noexec,mode=755}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/systemd, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd, device = rw,nosuid,nodev,noexec,relatime,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/pstore, spec_device = pstore, type = pstore, options = rw,nosuid,nodev,noexec,relatime, device = rw,nosuid,nodev,noexec,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/cpu,cpuacct, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,cpuacct,cpu, device = rw,nosuid,nodev,noexec,relatime,cpuacct,cpu}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/pids, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,pids, device = rw,nosuid,nodev,noexec,relatime,pids}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/cpuset, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,cpuset, device = rw,nosuid,nodev,noexec,relatime,cpuset}
[2020-05-07 21:27:31] Initialization complete, entering read-loop.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `numa' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `numa' plugin at 1588886861.414.
[2020-05-07 21:27:31] plugin_read_thread: Handling `irq'.
[2020-05-07 21:27:31] regex match: Regular expression `.*' matches `numa'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Rule `add_topology_tags' matches.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `cpu' plugin took 0.000292 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `cpu' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `cpu' plugin at 1588886861.413.
[2020-05-07 21:27:31] plugin_read_thread: Handling `protocols'.
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/hugetlb, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,hugetlb, device = rw,nosuid,nodev,noexec,relatime,hugetlb}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/net_cls,net_prio, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,net_prio,net_cls, device = rw,nosuid,nodev,noexec,relatime,net_prio,net_cls}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/memory, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,memory, device = rw,nosuid,nodev,noexec,relatime,memory}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/blkio, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,blkio, device = rw,nosuid,nodev,noexec,relatime,blkio}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/freezer, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,freezer, device = rw,nosuid,nodev,noexec,relatime,freezer}
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] Created new plugin context.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `contextswitch' plugin took 0.000158 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `contextswitch' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `contextswitch' plugin at 1588886861.414.
[2020-05-07 21:27:31] plugin_read_thread: Handling `ethstat'.
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/perf_event, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,perf_event, device = rw,nosuid,nodev,noexec,relatime,perf_event}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/fs/cgroup/devices, spec_device = cgroup, type = cgroup, options = rw,nosuid,nodev,noexec,relatime,devices, device = rw,nosuid,nodev,noexec,relatime,devices}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/kernel/config, spec_device = configfs, type = configfs, options = rw,relatime, device = rw,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /, spec_device = /dev/sda2, type = xfs, options = rw,relatime,attr2,inode64,noquota, device = rw,relatime,attr2,inode64,noquota}
[2020-05-07 21:27:31] utils_mount: new = {dir = /proc/sys/fs/binfmt_misc, spec_device = systemd-1, type = autofs, options = rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=25686, device = rw,relatime,fd=30,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=25686}
[2020-05-07 21:27:31] utils_mount: new = {dir = /dev/mqueue, spec_device = mqueue, type = mqueue, options = rw,relatime, device = rw,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /sys/kernel/debug, spec_device = debugfs, type = debugfs, options = rw,relatime, device = rw,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /dev/hugepages, spec_device = hugetlbfs, type = hugetlbfs, options = rw,relatime, device = rw,relatime}
[2020-05-07 21:27:31] utils_mount: new = {dir = /app/cassandra, spec_device = /dev/mapper/cass-volume--410322564--1, type = xfs, options = rw,relatime,attr2,inode64,sunit=256,swidth=1024,noquota, device = rw,relatime,attr2,inode64,sunit=256,swidth=1024,noquota}
[2020-05-07 21:27:31] plugin_dispatch_values: time = 1588886851.423; interval = 10.000; host = 10.12.49.179; plugin = vmem; plugin_instance = ; type = vmpage_number; type_instance = inactive_anon;
[2020-05-07 21:27:31] fc_process_chain (chain = PreCache);
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_topology_tags' rule.
[2020-05-07 21:27:31] regex match: Regular expression `.*' matches `vmem'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Rule `add_topology_tags' matches.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `cluster': `pcleeperf-stg'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `cluster': `pcleeperf-stg'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `dc': `westus'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `rack': `0'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_series_tag' rule.
[2020-05-07 21:27:31] regex match: Regular expression `^(cpu|interfaces|disk)$' does not match `vmem'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Executing the default targets.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Signaling `continue' at end of chain.
[2020-05-07 21:27:31] uc_insert: Added 10.12.49.179/vmem/vmpage_number-inactive_anon to the cache.
[2020-05-07 21:27:31] plugin: plugin_write: Writing values via write_scribe.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `ethstat' plugin took 0.000010 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `ethstat' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `ethstat' plugin at 1588886861.415.
[2020-05-07 21:27:31] plugin_read_thread: Handling `disk'.
[2020-05-07 21:27:31] plugin_dispatch_values: time = 1588886851.423; interval = 10.000; host = 10.12.49.179; plugin = vmem; plugin_instance = ; type = vmpage_number; type_instance = active_anon;
[2020-05-07 21:27:31] fc_process_chain (chain = PreCache);
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_topology_tags' rule.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `dc': `westus'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `rack': `0'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_series_tag' rule.
[2020-05-07 21:27:31] regex match: Regular expression `^(cpu|interfaces|disk)$' does not match `numa'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Executing the default targets.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Signaling `continue' at end of chain.
[2020-05-07 21:27:31] uc_insert: Added 10.12.49.179/numa-node0/vmpage_action-numa_hit to the cache.
[2020-05-07 21:27:31] plugin: plugin_write: Writing values via write_scribe.
[2020-05-07 21:27:31] plugin_dispatch_values: time = 1588886851.423; interval = 10.000; host = 10.12.49.179; plugin = vmem; plugin_instance = ; type = vmpage_number; type_instance = alloc_batch;
[2020-05-07 21:27:31] fc_process_chain (chain = PreCache);
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_topology_tags' rule.
[2020-05-07 21:27:31] regex match: Regular expression `.*' matches `vmem'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Rule `add_topology_tags' matches.
[2020-05-07 21:27:31] plugin_dispatch_values: time = 1588886851.423; interval = 10.000; host = 10.12.49.179; plugin = vmem; plugin_instance = ; type = vmpage_number; type_instance = free_pages;
[2020-05-07 21:27:31] fc_process_chain (chain = PreCache);
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_topology_tags' rule.
[2020-05-07 21:27:31] regex match: Regular expression `.*' matches `vmem'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Rule `add_topology_tags' matches.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `irq' plugin took 0.001966 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `irq' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `irq' plugin at 1588886861.414.
[2020-05-07 21:27:31] plugin_read_thread: Handling `interface'.
[2020-05-07 21:27:31] regex match: Regular expression `.*' matches `vmem'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Rule `add_topology_tags' matches.
[2020-05-07 21:27:31] utils_mount: new = {dir = /boot, spec_device = /dev/sda1, type = xfs, options = rw,relatime,attr2,inode64,noquota, device = rw,relatime,attr2,inode64,noquota}
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `interface' plugin took 0.000628 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `interface' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `interface' plugin at 1588886861.415.
[2020-05-07 21:27:31] plugin_read_thread: Handling `load'.
[2020-05-07 21:27:31] disk plugin: disk_name = fd0; read_ops = 0; ds->read_ops = 0; diff_read_ops = 0;
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `load' plugin took 0.000045 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `load' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `load' plugin at 1588886861.415.
[2020-05-07 21:27:31] plugin_read_thread: Handling `memory'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `cluster': `pcleeperf-stg'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `dc': `westus'.
[2020-05-07 21:27:31] target_set: ts_invoke: setting metadata value for key `rack': `0'.
[2020-05-07 21:27:31] fc_process_chain (PreCache): Testing the `add_series_tag' rule.
[2020-05-07 21:27:31] plugin_read_thread: read-function of the `memory' plugin took 0.000199 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Effective interval of the `memory' plugin is 10.000 seconds.
[2020-05-07 21:27:31] plugin_read_thread: Next read of the `memory' plugin at 1588886861.416.
[2020-05-07 21:27:31] plugin_read_thread: Handling `swap'.
[2020-05-07 21:27:31] disk plugin: (ds->poll_count = 1) <= (min_poll_count = 2); => Not writing.
[2020-05-07 21:27:31] disk plugin: disk_name = sr0; read_ops = 0; ds->read_ops = 0; diff_read_ops = 0;
[2020-05-07 21:27:31] disk plugin: (ds->poll_count = 1) <= (min_poll_count = 2); => Not writing.
few lines from the cassandra system.log after restarting
INFO [main] 2020-05-07 21:27:31,335 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
WARN [main] 2020-05-07 21:27:31,468 UnixSocketClient.java:299 - Not able to start collectd, will keep trying: BROKEN
INFO [main] 2020-05-07 21:27:31,665 UnixSocketClient.java:777 - Starting metric reporting with 30 sec interval
INFO [main] 2020-05-07 21:27:31,668 UnixSocketClient.java:697 - Starting event reporting with 300 sec interval
WARN [main] 2020-05-07 21:27:31,672 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [insights-3-1] 2020-05-07 21:27:36,674 UnixSocketClient.java:707 - Event reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 21:27:36,755 UnixSocketClient.java:719 - didn't report node config
WARN [insights-3-1] 2020-05-07 21:27:36,771 UnixSocketClient.java:738 - didn't report system info
WARN [insights-3-1] 2020-05-07 21:27:36,814 UnixSocketClient.java:757 - didn't report schema info
INFO [ScheduledTasks:1] 2020-05-07 21:28:01,667 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
INFO [insights-3-1] 2020-05-07 21:28:01,669 UnixSocketClient.java:788 - Metric reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 21:28:01,690 NoSpamLogger.java:94 - Connection to Collectd not established
ERROR [ScheduledTasks:1] 2020-05-07 21:28:06,870 CollectdController.java:355 - Collectd start failed
WARN [ScheduledTasks:1] 2020-05-07 21:28:06,913 UnixSocketClient.java:504 - Error connecting to collectd
INFO [insights-3-1] 2020-05-07 21:28:31,849 UnixSocketClient.java:788 - Metric reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 21:28:31,852 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [ScheduledTasks:1] 2020-05-07 21:28:36,915 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
ERROR [ScheduledTasks:1] 2020-05-07 21:28:42,119 CollectdController.java:355 - Collectd start failed
WARN [ScheduledTasks:1] 2020-05-07 21:28:42,120 UnixSocketClient.java:504 - Error connecting to collectd
I wonder if it's one of the plugins causing the error. Could be permissions... Could you try disabling them perhaps the disk one? (except for the write_scribe one, thats how c* and collectd talk).
yeah, i just commented out all the plugin's except for a couple (cpu and mem) and that worked. so it's one of the plugins for sure. so now i just need to figure out which one
great!
well..weird..., i'm still looking around but... collectd tends to work with everything enabled.. but it's the <Chain "PreCache"> section that when i have that enabled collectd just won't start up..
and i don't see any cassandra metrics if i comment that out and get it running. just all the os metrics if i hit the prometheus endpoint..
it seems to be a bit flaky
INFO [main] 2020-05-07 23:26:45,407 CassandraDaemonInterceptor.java:49 - Starting DataStax Metric Collector for Apache Cassandra 0.1.1
INFO [main] 2020-05-07 23:26:45,423 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
INFO [main] 2020-05-07 23:26:46,594 UnixSocketClient.java:499 - Connection to collectd established
INFO [main] 2020-05-07 23:26:46,863 UnixSocketClient.java:777 - Starting metric reporting with 30 sec interval
INFO [main] 2020-05-07 23:26:46,865 UnixSocketClient.java:697 - Starting event reporting with 300 sec interval
INFO [insights-3-1] 2020-05-07 23:27:16,886 UnixSocketClient.java:788 - Metric reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 23:27:16,890 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [ScheduledTasks:1] 2020-05-07 23:27:16,894 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
INFO [ScheduledTasks:1] 2020-05-07 23:27:18,002 UnixSocketClient.java:499 - Connection to collectd established
INFO [insights-3-1] 2020-05-07 23:27:47,055 UnixSocketClient.java:788 - Metric reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 23:27:47,056 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [ScheduledTasks:1] 2020-05-07 23:27:48,031 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
INFO [ScheduledTasks:1] 2020-05-07 23:27:49,152 UnixSocketClient.java:499 - Connection to collectd established
INFO [insights-3-1] 2020-05-07 23:28:17,120 UnixSocketClient.java:788 - Metric reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 23:28:17,121 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [ScheduledTasks:1] 2020-05-07 23:28:19,185 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
INFO [ScheduledTasks:1] 2020-05-07 23:28:19,286 UnixSocketClient.java:499 - Connection to collectd established
INFO [insights-3-1] 2020-05-07 23:28:47,185 UnixSocketClient.java:788 - Metric reporting skipped due to connection to collectd not being established
WARN [insights-3-1] 2020-05-07 23:28:47,186 NoSpamLogger.java:94 - Connection to Collectd not established
INFO [ScheduledTasks:1] 2020-05-07 23:28:49,318 CollectdController.java:471 - /home/cassandra/datastax-mcac-agent-0.1.1/config/collectd.conf.tmpl
INFO [ScheduledTasks:1] 2020-05-07 23:28:50,459 UnixSocketClient.java:499 - Connection to collectd established
if i comment out this section of the config file, this is where i get to the state i'm at now..
# <Rule "add_topology_tags">
# <Match regex>
# Plugin ".*"
# </Match>
# <Target "set">
# MetaData "cluster" "{{cluster}}"
# MetaData "dc" "{{dataCenter}}"
# MetaData "rack" "{{rack}}"
# </Target>
# </Rule>
is the "Ignoring config option `SocketFile' normal?
[2020-05-07 23:39:30] type = logfile, key = LogLevel, value = info
[2020-05-07 23:39:30] plugin_load: plugin "write_scribe" successfully loaded.
[2020-05-07 23:39:30] write_scribe plugin: Ignoring config option `SocketFile'.
[2020-05-07 23:39:30] plugin_load: plugin "write_prometheus" successfully loaded.
[2020-05-07 23:39:30] plugin_load: plugin "match_regex" successfully loaded.
[2020-05-07 23:39:30] plugin_load: plugin "target_set" successfully loaded.
[2020-05-07 23:39:30] plugin_load: plugin "target_replace" successfully loaded.
[2020-05-07 23:39:30] unixsock plugin: Successfully deleted socket file "/tmp/ds-7036458465289524625.sock".
[2020-05-07 23:39:30] Initialization complete, entering read-loop.
Can you run ps -ef and find the process args for collectd? It should point to the template rendering of the collectd.conf file. Please post that. it looks like something bad is going into as rack ?
Also, I released a fix with the rack/dc name stuff. could you try that build (v0.1.2)? It could be related to that since it's the first thing you hit
i thought maybe something with the rack name too. i tried several different things of commenting out juts that rack metadata , just dc etc.. i tore this down and redeployed with a different rack name (instead of a 1 , 0 or 2.. i did r0) and still same thing.
so i updated to the newer version and tried it, still same..
so i commented out that section so i can get it to run..
$ ps -ef | grep collectd
cassand+ 95081 1 0 15:55 ? 00:00:00 collectd_wrapper /home/cassandra/datastax-mcac-agent-0.1.2/lib/collectd/usr/sbin/collectd -C /tmp/mcac-collectd-7404448595816808467.conf -P /app/cassandra/current/bin/../logs/mcac-collectd.pid
i went and grabbed the temp conf file.. the dc, rack and cluster all look correct there.
TypesDB "/home/cassandra/datastax-mcac-agent-0.1.2/lib/collectd/usr/share/collectd/types.db"
PluginDir "/home/cassandra/datastax-mcac-agent-0.1.2/lib/collectd/usr/lib/collectd"
BaseDir "/home/cassandra/datastax-mcac-agent-0.1.2/lib/collectd"
Hostname "10.12.49.63"
FQDNLookup false
#Timeout needs to be high otherwise we can
#cause a invalidation storm on large # of tables
Interval 10
Timeout 10
ReadThreads 5
WriteThreads 5
WriteQueueLimitHigh 500000
WriteQueueLimitLow 500000
CollectInternalStats true
LoadPlugin logfile
<Plugin "logfile">
File "/app/cassandra/current/bin/../logs/cassandra-collectd.log"
LogLevel "debug"
</Plugin>
LoadPlugin write_scribe
<Plugin "write_scribe">
ConfigFile "/tmp/mcac-collectd-scribe-5715359051930293352.conf"
<File "/app/cassandra/current/bin/../logs/cassandra-collectd.log">
Instance "collectd.log"
</File>
SocketFile "/tmp/ds-288985274633111343.sock"
</Plugin>
LoadPlugin write_prometheus
<Plugin write_prometheus>
Port "9103"
</Plugin>
LoadPlugin cpu
LoadPlugin df
<Plugin df>
FsType "ext3"
FsType "ext4"
FsType "xfs"
FsType "zfs"
IgnoreSelected False
</Plugin>
LoadPlugin vmem
LoadPlugin numa
LoadPlugin contextswitch
LoadPlugin cpufreq
LoadPlugin irq
LoadPlugin protocols
LoadPlugin ethstat
LoadPlugin disk
<Plugin "disk">
Disk "/^loop\d+$/"
Disk "/^dm-\d+$/"
IgnoreSelected "true"
</Plugin>
LoadPlugin interface
<Plugin "interface">
Interface "/^lo\d*$/"
Interface "/^docker.*/"
Interface "/^t(un|ap)\d*$/"
Interface "/^veth.*$/"
IgnoreSelected "true"
</Plugin>
LoadPlugin load
LoadPlugin memory
LoadPlugin swap
LoadPlugin uptime
LoadPlugin processes
LoadPlugin tcpconns
<Plugin "processes">
ProcessMatch "all" ".*"
</Plugin>
<Plugin "tcpconns">
ListeningPorts true
</Plugin>
LoadPlugin match_regex
LoadPlugin target_set
LoadPlugin target_replace
PreCacheChain "PreCache"
<Chain "PreCache">
# <Rule "add_topology_tags">
# <Match regex>
# Plugin ".*" #Match all
# </Match>
# <Target "set">
# MetaData "cluster" "pcleeperf-stg"
# MetaData "dc" "westus"
# MetaData "rack" "2"
# </Target>
# </Rule>
<Rule "add_series_tag">
<Match regex>
Plugin "^(cpu|interfaces|disk)$"
</Match>
<Target "set">
MetaData "insight_series" "1"
</Target>
</Rule>
</Chain>
Include "/etc/cassandra/collectd/*.conf"
Include "/home/cassandra/datastax-mcac-agent-0.1.2/lib/collectd/etc/collectd/*.conf"
I'm wondering if there's a problem with duplicate metadata names but for some reason I can't reproduce this. Could you try changing that metadata in the template and change the names from cluster -> cluster1 dc -> dc1 rack -> rack1
Also a core dump if you can get one would help!
ok, i tried what you mentioned above and still same thing. what i'm doing is trying to break things down to figure out what actually causes it. i was able to spin up a plan centos 7.5 vm and just lay the tarball down and the agent and just start it with no problem, metrics work and all. but i've tried to install this exporter on a few different test clusters i have and all have this same problem. so i'm not sure if it's with how i have some things setup when my automation installs cassandra or some other package etc.. so i'll spend some time trying to break that down and see if i can find out what causes it to see how i may be able to have steps to reproduce it. i'll see if i can get a dump also.
Thanks so much
seems like it's a ulimit setting that causes it to crash, we have a set of settings we always typically apply and it looks like the one that causes the problem is
* - stack unlimited
if i remove this everything works fine.
Fantastic detective work! I'll repro and find the bug.