calamari icon indicating copy to clipboard operation
calamari copied to clipboard

Diamond Ceph Stats not received in calamari

Open drolfe opened this issue 9 years ago • 7 comments

Everything is working except the ceph and pool graph stats in the calamari gui, the host stats are working fine

image

image

image

root@calamari:~# dpkg -l | egrep -i "calamari|salt" | awk '{print $2 "\t\t" $3}'
calamari-clients        1.3.1.1-1trusty
calamari-server         1.3.0.1-11-g9fb65ae
salt-common             2014.7.5+ds-1ubuntu1
salt-master             2014.7.5+ds-1ubuntu1
salt-minion             2014.7.5+ds-1ubuntu1
root@calamari:~#
root@ceph1:~# dpkg -l | egrep -i "ceph|salt|diamond" | awk '{print $2 "\t\t" $3}'
ceph                    9.2.0-1trusty
ceph-common             9.2.0-1trusty
ceph-mds                9.2.0-1trusty
diamond                 3.4.67
libcephfs1              9.2.0-1trusty
python-cephfs           9.2.0-1trusty
python-rados            9.2.0-1trusty
python-rbd              9.2.0-1trusty
salt-common             0.17.5+ds-1
salt-minion             0.17.5+ds-1
root@ceph1:~#

Let me know what more I should be checking

Regards, Daniel

drolfe avatar Jan 18 '16 10:01 drolfe

Note I've build the latest git stable deb packages via vagrant, still with the same issue

root@calamari:~# dpkg -l | egrep -i "calamari|salt|romana" | awk '{print $2 "\t\t" $3}'
calamari-server         1.3.1.1-105-g79c8df2-1trusty
romana                  1.2.2-36-gc62bb5b
salt-common             2014.7.5+ds-1ubuntu1
salt-master             2014.7.5+ds-1ubuntu1
salt-minion             2014.7.5+ds-1ubuntu1
root@calamari:~#

Also on the client I've matched the salt versions which is recommended

root@ceph1:~# dpkg -l | egrep -i "salt|diamond" | awk '{print $2 "\t\t" $3}'
diamond                 3.4.67
salt-common             2014.7.5+ds-1ubuntu1
salt-minion             2014.7.5+ds-1ubuntu1
root@ceph1:~#

drolfe avatar Jan 24 '16 04:01 drolfe

Doing a server diamond restart show the below:

root@ceph1:~# tail -f /var/log/diamond/diamond.log
[2016-01-24 04:19:21,039] [MainThread] pysnmp.entity.rfc3413.oneliner.cmdgen failed to load
[2016-01-24 04:19:21,043] [MainThread] pysnmp.entity.rfc3413.oneliner.cmdgen failed to load
[2016-01-24 04:19:21,044] [MainThread] pysnmp.entity.rfc3413.oneliner.cmdgen failed to load
[2016-01-24 04:19:21,046] [MainThread] pysnmp.entity.rfc3413.oneliner.cmdgen failed to load
[2016-01-24 04:19:21,056] [MainThread] pysnmp.entity.rfc3413.oneliner.cmdgen failed to load
[2016-01-24 04:19:21,074] [MainThread] pysnmp.entity.rfc3413.oneliner.cmdgen failed to load
[2016-01-24 04:19:22,252] [Thread-1] Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/diamond/collector.py", line 412, in _run
    self.collect()
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 464, in collect
    self._collect_service_stats(path)
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 450, in _collect_service_stats
    self._publish_stats(counter_prefix, stats, schema, GlobalName)
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 305, in _publish_stats
    assert path[-1] == 'type'
AssertionError
^C

root@ceph1:~# md5sum /usr/lib/pymodules/python2.7/diamond/collector.py
08bb05a483fa3d1d64c0ebf690259a05  /usr/lib/pymodules/python2.7/diamond/collector.py
root@ceph1:~# md5sum /usr/share/diamond/collectors/ceph/ceph.py
aeb3915f8ac7fdea61495805d2c99f33  /usr/share/diamond/collectors/ceph/ceph.py
root@ceph1:~#

drolfe avatar Jan 24 '16 04:01 drolfe

Looking at the calamari.log I can see it's looking for missing graphite metric data


root@calamari:/var/log/calamari# tail -f calamari.log
2016-01-23 22:44:54,040 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_space
2016-01-23 22:44:54,041 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_avail
2016-01-23 22:44:58,560 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_objects
2016-01-23 22:44:58,561 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_bytes
2016-01-23 22:44:58,835 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_used_bytes
2016-01-23 22:44:58,835 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_used
2016-01-23 22:44:58,836 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_space
2016-01-23 22:44:58,836 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_avail
2016-01-23 22:44:58,893 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_objects
2016-01-23 22:44:58,894 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_bytes
2016-01-23 22:45:14,440 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_used_bytes
2016-01-23 22:45:14,441 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_used
2016-01-23 22:45:14,442 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_space
2016-01-23 22:45:14,442 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_avail
2016-01-23 22:45:18,373 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_objects
2016-01-23 22:45:18,377 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_bytes
2016-01-23 22:45:18,878 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_objects
2016-01-23 22:45:18,879 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.pool.0.num_bytes
2016-01-23 22:45:19,269 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_used_bytes
2016-01-23 22:45:19,270 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_used
2016-01-23 22:45:19,275 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_space
2016-01-23 22:45:19,276 - metric_access - django.request No graphite data for ceph.cluster.85895b09-7e2d-4290-b053-e7a71f8b5e08.df.total_avail
^C
root@calamari:/var/log/calamari#

drolfe avatar Jan 24 '16 04:01 drolfe

I can see the ok files are there

root@ceph1:/var/run/ceph# ls -la
total 0
drwxrwx---  2 ceph ceph  80 Feb  1 10:51 .
drwxr-xr-x 18 root root 640 Feb  1 10:52 ..
srwxr-xr-x  1 ceph ceph   0 Feb  1 10:51 ceph-mon.ceph1.asok
srwxr-xr-x  1 root root   0 Jan 27 15:08 ceph-osd.0.asok
root@ceph1:/var/run/ceph#
root@ceph1:/var/run/ceph#
root@ceph1:/var/run/ceph#

Running diamond in debug show the below

[2016-02-01 10:55:23,774] [Thread-1] Collecting data from: NetworkCollector
[2016-02-01 10:56:23,484] [Thread-1] Collecting data from: CPUCollector
[2016-02-01 10:56:23,487] [Thread-6] Collecting data from: MemoryCollector
[2016-02-01 10:56:23,489] [Thread-7] Collecting data from: SockstatCollector
[2016-02-01 10:56:23,768] [Thread-1] Collecting data from: CephCollector
[2016-02-01 10:56:23,768] [Thread-1] gathering service stats for /var/run/ceph/ceph-mon.ceph1.asok
[2016-02-01 10:56:24,094] [Thread-1] Traceback (most recent call last):
  File "/usr/lib/pymodules/python2.7/diamond/collector.py", line 412, in _run
    self.collect()
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 464, in collect
    self._collect_service_stats(path)
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 450, in _collect_service_stats
    self._publish_stats(counter_prefix, stats, schema, GlobalName)
  File "/usr/share/diamond/collectors/ceph/ceph.py", line 305, in _publish_stats
    assert path[-1] == 'type'
AssertionError

[2016-02-01 10:56:24,096] [Thread-8] Collecting data from: LoadAverageCollector
[2016-02-01 10:56:24,098] [Thread-1] Collecting data from: VMStatCollector
[2016-02-01 10:56:24,099] [Thread-1] Collecting data from: DiskUsageCollector
[2016-02-01 10:56:24,104] [Thread-9] Collecting data from: DiskSpaceCollector

Check the md5 on the file returns the below:

root@ceph1:/var/run/ceph# md5sum /usr/share/diamond/collectors/ceph/ceph.py
aeb3915f8ac7fdea61495805d2c99f33  /usr/share/diamond/collectors/ceph/ceph.py
root@ceph1:/var/run/ceph#

I've found that replacing the ceph.py file with the below stops the diamond error

Diamond version 3.4.67

https://raw.githubusercontent.com/BrightcoveOS/Diamond/master/src/collectors/ceph/ceph.py

root@ceph1:/usr/share/diamond/collectors/ceph# md5sum ceph.py
13ac74ce0df39a5def879cb5fc530015  ceph.py


[2016-02-01 11:14:33,116] [Thread-42] Collecting data from: MemoryCollector
[2016-02-01 11:14:33,117] [Thread-1] Collecting data from: CPUCollector
[2016-02-01 11:14:33,123] [Thread-43] Collecting data from: SockstatCollector
[2016-02-01 11:14:35,453] [Thread-1] Collecting data from: CephCollector
[2016-02-01 11:14:35,454] [Thread-1] checking /var/run/ceph/ceph-mon.ceph1.asok
[2016-02-01 11:14:35,552] [Thread-1] checking /var/run/ceph/ceph-osd.0.asok
[2016-02-01 11:14:35,685] [Thread-44] Collecting data from: LoadAverageCollector
[2016-02-01 11:14:35,686] [Thread-1] Collecting data from: VMStatCollector
[2016-02-01 11:14:35,687] [Thread-1] Collecting data from: DiskUsageCollector
[2016-02-01 11:14:35,692] [Thread-45] Collecting data from: DiskSpaceCollector

But after all that it's still not working

drolfe avatar Feb 01 '16 12:02 drolfe

Ok Thanks to the below reply on the mailing list

John Spray Mon, 01 Feb 2016 04:23:24 -0800

The "assert path[-1] == 'type'" is the error you get when using the
calamari diamond branch with a >= infernalis version of Ceph (where
new fields were added to the perf schema output).  No idea if anyone
has worked on updating Calamari+Diamond for latest ceph.

John

I've downgraded to hammer, now everything is working

I've build the latest calamari server, diamond and new calamari clients (now called romana)

Feel free to use them on your trusty deployments

http://bladeservers.net.au/calamari-server_1.3.1.1-105-g79c8df2-1trusty_amd64.deb http://bladeservers.net.au/romana_1.2.2-36-gc62bb5b_all.deb http://bladeservers.net.au/diamond_3.4.725_all.deb

Calamari All working

drolfe avatar Feb 04 '16 12:02 drolfe

@drolfe Thanks alot for your packages. In order to make IOPS / usage data appear in graphite / calamaris when running Ceph Infernalis, a small change in the ceph.py collector script is required. See https://github.com/luinnar/Diamond/commit/a9fcc62097b82e1df5cc16fb78bcdebf0ab11d0d

kaazoo avatar Feb 12 '16 13:02 kaazoo

I've been waiting to get this upstream: https://github.com/python-diamond/Diamond/pull/321 that will get you a newer diamond 4.X and fix infernalis

ChristinaMeno avatar Feb 13 '16 06:02 ChristinaMeno