calamari
calamari copied to clipboard
Server Error 500 : ERROR - django.request Internal Server Error
Hello Developers
Could you extend your help in fixing this issue
Calamari.log
2015-06-23 13:53:41,317 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_used_bytes
2015-06-23 13:53:41,329 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_used
2015-06-23 13:53:41,330 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_space
2015-06-23 13:53:41,330 - metric_access - django.request No graphite data for ceph.cluster.9609b429-eee2-4e23-af31-28a24fcf5cbc.df.total_avail
2015-06-23 13:53:41,394 - ERROR - django.request Internal Server Error: /api/v1/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc/health_counters
Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/django/core/handlers/base.py", line 115, in get_response
response = callback(request, *callback_args, **callback_kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/rest_framework/viewsets.py", line 78, in view
return self.dispatch(request, *args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/rpc_view.py", line 94, in dispatch
self.client.close()
File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 292, in close
ClientBase.close(self)
File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/core.py", line 194, in close
self._multiplexer.close()
File "/opt/calamari/venv/lib/python2.7/site-packages/zerorpc/channel.py", line 61, in close
self._channel_dispatcher_task.kill()
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/greenlet.py", line 235, in kill
waiter.get()
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 575, in get
return self.hub.switch()
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 338, in switch
return greenlet.switch(self)
LostRemote: Lost remote after 10s heartbeat
My environment details
[root@ceph-node1 ~]# rpm -qa | grep -i supervisor
supervisor-3.0-1.el7.noarch
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# rpm -qa | grep -i calamari
calamari-server-1.3.0.1-49_g828960a.el7.centos.x86_64
calamari-clients-1.2.2-32_g931ee58.el7.centos.x86_64
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# ceph -v
ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# cat /etc/redhat-release
CentOS Linux release 7.0.1406 (Core)
[root@ceph-node1 ~]#
[root@ceph-node1 ~]# rpm -qa | grep -i salt
salt-2015.5.0-1.el7.noarch
salt-master-2015.5.0-1.el7.noarch
salt-minion-2015.5.0-1.el7.noarch
[root@ceph-node1 ~]#
IRRC this is one of the most annoying error message with calamari dashboard , i have been watching this error over mailing lists since very long. Most of the people involved with calamari have seen this during their stint with calamari.
So i hope we can fix this once for all ( #dream )
This looks...possibly relevant. From: https://github.com/ceph/calamari/blob/c64121ab01aef0be6dfc3bef1940e21fe09af45f/rest-api/calamari_rest/views/v1.py#L58
# In case the cluster has been offline for some time, try looking progressively
# further back in time for data. This would not be necessary if graphite simply
# let us ask for the latest value (Calamari issue #6876)
for trange in ['-1min', '-10min', '-60min', '-1d', '-7d']:
val = _get(parseATTime(trange, tzinfo))
if val is not None:
return val
I could at least see that causing the timeout. Now, what the correct workaround is...looks like the calamari guys would like graphite's functionality expanded a bit. A quick hack to try would be to shorten the trange drastically, or remove that for loop at all. If you Stop timing out, it seems like the only repercussion would be that you'd lose the data that it's logging about. Not sure if that would be a tragic loss to you or not.
All "500" means is "something went wrong". It's basically what happens when anything in a hugely complicated multistep process fails.
chances are good all this means is that something's broken in cthulhu talking to the cluster or running on its own. basic troubleshooting:
- does salt work to the minions
- is cthulhu running without errors; check all logs in /var/log/calamari
- increase cthulhu's debug level in calamari.conf
- try talking to the Calamari API directly from the browser while watching the logs
Tracing through the code, it looks like this is the guy instigating the graphite operations that are being logged about:
https://github.com/ceph/calamari/blob/c64121ab01aef0be6dfc3bef1940e21fe09af45f/rest-api/calamari_rest/views/v1.py#L100
Looks like he should just handle the case where the missing requests fail. Seems suspicious that a timeout log is coincident with actual failures to recover those values. @ksingh7, definitely give @dmick's suggestions a shot.
@dmick Thanks for you answer
History : Calamari server , client and diamond package build was successful. Post that initial calamari configuration including salt-keys stuff went fine. However calamari-ctl initialize
gave some errors but finally it worked.
The dashboard was working nicely till i added first node , when i added the remaining nodes to calamari ( salt-minion --> diamond --> salt-key -A ) the dashboard broke and throw this error.
- Yes salt-master and minion works
[root@ceph-node1 views]# salt-key -L
Accepted Keys:
ceph-node1
ceph-node2
ceph-node3
Denied Keys:
Unaccepted Keys:
Rejected Keys:
[root@ceph-node1 views]#
- cthulhu is running BUT with errors @dmick you thought it right
[root@ceph-node1 views]# supervisorctl status
carbon-cache RUNNING pid 29279, uptime 0:37:26
cthulhu RUNNING pid 29284, uptime 0:37:18
[root@ceph-node1 views]#
Repeatedly getting these messages in cthulhu.log
2015-06-23 22:49:46,088 - WARNING - cthulhu Abandoning fetch for mon_map started at 2015-06-23 19:48:54.057496+00:00
2015-06-23 22:49:46,088 - ERROR - cthulhu Exception handling message with tag ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in _run
self.on_heartbeat(data['id'], data['data'])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped
return func(*args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat
cluster_data['versions'][sync_type.str])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version
self.fetch(reported_by, sync_type)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch
client = LocalClient(config.get('cthulhu', 'salt_config_path'))
File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 126, in __init__
self.opts = salt.config.client_config(c_path)
File "/usr/lib/python2.7/site-packages/salt/config.py", line 2176, in client_config
File "/usr/lib/python2.7/site-packages/salt/utils/xdg.py", line 13, in xdg_config_dir
File "/opt/calamari/venv/lib64/python2.7/posixpath.py", line 269, in expanduser
KeyError: 'getpwuid(): uid not found: 0'
2015-06-23 22:49:56,392 - WARNING - cthulhu Abandoning fetch for osd_map started at 2015-06-23 19:49:36.709092+00:00
2015-06-23 22:49:56,393 - ERROR - cthulhu Exception handling message with tag ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
2015-06-23 22:53:49,138 - WARNING - cthulhu Abandoning fetch for mon_map started at 2015-06-23 19:53:19.073702+00:00
2015-06-23 22:53:49,288 - ERROR - cthulhu Exception handling message with tag ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
Traceback (most recent call last):
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in _run
self.on_heartbeat(data['id'], data['data'])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped
return func(*args, **kwargs)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat
cluster_data['versions'][sync_type.str])
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version
self.fetch(reported_by, sync_type)
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch
client = LocalClient(config.get('cthulhu', 'salt_config_path'))
File "/usr/lib/python2.7/site-packages/salt/client/__init__.py", line 136, in __init__
listen=not self.opts.get('__worker', False))
File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 112, in get_event
return MasterEvent(sock_dir or opts.get('sock_dir', None))
File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 510, in __init__
super(MasterEvent, self).__init__('master', sock_dir)
File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 176, in __init__
self.get_event(wait=1)
File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 361, in get_event
ret = self._get_event(wait, tag, pending_tags)
File "/usr/lib/python2.7/site-packages/salt/utils/event.py", line 305, in _get_event
socks = dict(self.poller.poll(wait * 1000))
File "/opt/calamari/venv/lib/python2.7/site-packages/zmq/green/poll.py", line 81, in poll
select.select(rlist, wlist, xlist)
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/select.py", line 68, in select
result.event.wait(timeout=timeout)
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/event.py", line 77, in wait
result = self.hub.switch()
File "/opt/calamari/venv/lib/python2.7/site-packages/gevent/hub.py", line 337, in switch
switch_out()
File "/opt/calamari/venv/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 15, in asserter
raise ForbiddenYield("Context switch during `nosleep` region!")
ForbiddenYield: Context switch during `nosleep` region!
- Some of the API commands are working with , but some dont
@dmick
- I increased logging level on cthulhu
What i found is , when it goes to ceph-node2 / ceph-node3 it cannot get cluster data and throws message cthulhu Ignoring cluster data from ceph-node2, it is not my favourite (ceph-node1)
Hope these logs can point us something
2015-06-24 00:33:08,271 - DEBUG - cthulhu _run.ev: ceph-node2/tag=ceph/server
2015-06-24 00:33:08,272 - DEBUG - cthulhu.server_monitor ServerMonitor got ceph/server message from ceph-node2
2015-06-24 00:33:08,272 - DEBUG - cthulhu.server_monitor ServerMonitor.on_server_heartbeat: ceph-node2
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='osd', service_id='5')
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='osd', service_id='4')
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='mon', service_id='ceph-node2')
2015-06-24 00:33:08,273 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='osd', service_id='3')
2015-06-24 00:33:08,274 - DEBUG - cthulhu.server_monitor ServerMonitor._register_service: ServiceId(fsid='9609b429-eee2-4e23-af31-28a24fcf5cbc', service_type='mds', service_id='ceph-node2')
2015-06-24 00:33:08,275 - DEBUG - cthulhu TopLevelEvents: ignoring ceph/server
2015-06-24 00:33:08,326 - DEBUG - cthulhu _run.ev: ceph-node2/tag=ceph/cluster/9609b429-eee2-4e23-af31-28a24fcf5cbc
2015-06-24 00:33:08,327 - DEBUG - cthulhu Ignoring cluster data from ceph-node2, it is not my favourite (ceph-node1)
2015-06-24 00:33:08,329 - DEBUG - cthulhu TopLevelEvents: heartbeat from existing cluster 9609b429-eee2-4e23-af31-28a24fcf5cbc
@ksingh7 The latest cthulhu logs are not the source of the problem.
the previous ones are:
ForbiddenYield: Context switch during nosleep
region!
- This issue is under discussion at https://github.com/saltstack/salt/issues/24613 until we find a resolution I suggest using salt-2014.7
KeyError: 'getpwuid(): uid not found: 0' This error might be worth tracking down
I made a change trying to fix the getpwuid problem. I have not seen it again so this might have worked.
/etc/apparmor.d/abstractions/python I added 1 line to the end /etc/passwd r,
EDIT: That did not help. I am still getting the getpwuid error.
When I see the error 500 on the screen at the same time the logs are showing up getpwuid errors, it works for a while then ctulhu starts trowing KeyError: 'getpwuid(): uid not found: 0' If i kill -HUP the process it works again for some time until the same error starts showing up on the logs
Also when cthulhu-manager hangs and 500 error starts showing up on the screen I did an lsof on the pid there is a pretty big number for anon_inode around 800 this number gradually increases until cpu utilization goes to 100 % and the 500 error starts showing up on the screen
Is there any fix for this ? I've applied the patch for salt_wrapper.py from git to work with 2015.5.2
Thanks
2015-07-15 04:12:45,765 - ERROR - cthulhu Exception handling message with tag ceph/cluster/db9c01f8-14a6-11e5-8515-2e924e5027c2 Traceback (most recent call last): File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in _run self.on_heartbeat(data['id'], data['data']) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped return func(_args, *_kwargs) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat cluster_data['versions'][sync_type.str]) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version self.fetch(reported_by, sync_type) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch client = LocalClient(config.get('cthulhu', 'salt_config_path')) File "/usr/lib/python2.7/dist-packages/salt/client/init.py", line 126, in init self.opts = salt.config.client_config(c_path) File "/usr/lib/python2.7/dist-packages/salt/config.py", line 2180, in client_config File "/usr/lib/python2.7/dist-packages/salt/utils/xdg.py", line 13, in xdg_config_dir File "/opt/calamari/venv/lib/python2.7/posixpath.py", line 269, in expanduser KeyError: 'getpwuid(): uid not found: 0'
Does your system really not have a uid 0 account installed?
On 07/14/2015 09:14 PM, Ivan wrote:
I can see the error 500 on the screen on the same time that on the logs, for me it works for a while then ctulhu starts trowing KeyError: 'getpwuid(): uid not found: 0' If i kill -HUP the process it works again for some time
Is there any fix for this ? I've applied the patch for salt_wrapper.py from git to work with 2015.5.2
2015-07-15 04:12:45,765 - ERROR - cthulhu Exception handling message with tag ceph/cluster/db9c01f8-14a6-11e5-8515-2e924e5027c2 Traceback (most recent call last): File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 245, in /run self.on_heartbeat(data['id'], data['data']) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/gevent_util.py", line 35, in wrapped return func(/args, _/kwargs) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 347, in on_heartbeat cluster_data['versions'][sync_type.str]) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 99, in on_version self.fetch(reported_by, sync_type) File "/opt/calamari/venv/local/lib/python2.7/site-packages/calamari_cthulhu-0.1-py2.7.egg/cthulhu/manager/cluster_monitor.py", line 109, in fetch client = LocalClient(config.get('cthulhu', 'salt_config_path')) File "/usr/lib/python2.7/dist-packages/salt/client/init/.py", line 126, in *init self.opts = salt.config.client_config(c_path) File "/usr/lib/python2.7/dist-packages/salt/config.py", line 2180, in client_config File "/usr/lib/python2.7/dist-packages/salt/utils/xdg.py", line 13, in xdg_config_dir File "/opt/calamari/venv/lib/python2.7/posixpath.py", line 269, in expanduser KeyError: 'getpwuid(): uid not found: 0'
— Reply to this email directly or view it on GitHub https://github.com/ceph/calamari/issues/309#issuecomment-121482356.
Ubuntu 14.04.2 LTS
We are using ubuntu 14, that error KeyError: 'getpwuid(): uid not found: 0' doesn't show up until the 'anon_inode' count gets really high
root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 288 root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 295 root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 309 root@radosgw-openstack-01:/opt/calamari# lsof -p 22427 | grep anon_inod | wc -l 337
cthulhu-m 22427 root 351u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 352u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 353u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 354u 0000 0,9 0 6289 anon_inode cthulhu-m 22427 root 355u 0000 0,9 0 6289 anon_inode
root@radosgw-openstack-01:/opt/calamari# id uid=0(root) gid=0(root) groups=0(root)
root@cephnode02:~# id uid=0(root) gid=0(root) groups=0(root)
@ivanoch79 you said you have applied a patch from github to work with salt 2015.5.2. Which patch is this ? Are you sure it worked for you.
I mean if you are still facing this error try salt 2014 version instead.
I had been puttering around with fixing all the parts that are incompatible with salt 2015.*, but I hit a wall recently and started encountering more confusing problems. The best advice I can give is to drop back to 2014 until we can spend more time triage my everything that changed in 2015.
Joe
On Jul 15, 2015, at 1:25 AM, karan singh <[email protected]mailto:[email protected]> wrote:
@ivanoch79https://github.com/ivanoch79 you said you have applied a patch from github to work with salt 2015.5.2. Which patch is this ? Are you sure it worked for you.
I mean if you are still facing this error try salt 2014 version instead.
� Reply to this email directly or view it on GitHubhttps://github.com/ceph/calamari/issues/309#issuecomment-121504013.
This is indeed nasty, I wasted half a day chasing down these exact issues before I found this thread. I finally reverted my saltstack back to 2014.7.4 and now it all works as expected.
Where did you guys find salt 2014 packages? The official repo only has 2015s available..
Has anyone tried salt 2015.8?
I tried 2015.5 today and immediately got a pile of 500's and "ForbiddenYield: Context switch during nosleep region!" messages, but after upgrading to salt 2015.8, this problem seems to have evaporated.
I have same error on 2015.8 , problem is as post above says 2014 is not available in EPEL. I found them however on this Korean mirror probably not using rsync --delete (good luck downloading). I installed 2014.7.5 on all nodes and calamari server.
http://mirror.oasis.onnetcorp.com/epel/testing/7/x86_64/s/
After fixing this error it now leads onto the next one..
Just to clarify this you actually need 2014.1.11 for it to work otherwise the cluster is not found in calamari. I did see reference above to 2014.7.4 , i initially tried 2014.7.5 and did not work (no cluster found). On installing 2014.1.11 i noticed it installed two dependencies (python-libcloud and sshpass). Not sure if there is something in this that made it work or not, have not tested upgrading to latest salt or newer variant to validate.
There is reference to the same by http://lists.ceph.com/pipermail/ceph-calamari-ceph.com/2015-July/000236.html
You may also have to give it a kick with the following
salt * ceph.heartbeat