kazoo
kazoo copied to clipboard
Deadlock with TreeCache and reconnection
Hi there!
First of all, thank you for such a great recipe as TreeCache. We recently tried to use it, but started experiencing ussies upon reconnection with large trees. After reconnecting every request fails with ConnectionLoss error. I managed to capture a stacktrace:
File "/venv/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 473, in zk_loop
if retry(self._connect_loop, retry) is STOP_CONNECTING:
File "/venv/lib/python2.7/site-packages/kazoo/retry.py", line 123, in __call__
return func(*args, **kwargs)
File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 512, in _connect_loop
status = self._connect_attempt(host, port, retry)
File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 539, in _connect_attempt
read_timeout, connect_timeout = self._connect(host, port)
File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 646, in _connect
client._session_callback(KeeperState.CONNECTED)
File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 467, in _session_callback
self._make_state_change(KazooState.CONNECTED)
File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 440, in _make_state_change
remove = listener(state)
File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 179, in _session_watcher
self._root.on_reconnected()
File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 217, in on_reconnected
child.on_reconnected()
File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 215, in on_reconnected
self._refresh()
File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 247, in _refresh
self._refresh_data()
File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 251, in _refresh_data
self._call_client('get', self._path)
File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 264, in _call_client
method(path, *args, **kwargs).rawlink(callback)
File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 1065, in get_async
async_result)
File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 547, in _call
write_sock.send(b'\0')
File "/venv/lib/python2.7/site-packages/gevent/socket.py", line 443, in send
self._wait(self._write_event)
File "/venv/lib/python2.7/site-packages/gevent/socket.py", line 300, in _wait
self.hub.wait(watcher)
File "/venv/lib/python2.7/site-packages/gevent/hub.py", line 348, in wait
result = waiter.get()
File "/venv/lib/python2.7/site-packages/gevent/hub.py", line 575, in get
return self.hub.switch()
File "/venv/lib/python2.7/site-packages/gevent/hub.py", line 338, in switch
return greenlet.switch(self)
I think here what happens:
- kazoo calls session callback
- tree calls
self._root.on_reconnected() - which in turn issues
getandget_childrencalls in kazoo - these calls are put in queue and '\0' is written into "wake up" socket
- socket is "blocking" (it appears as blocking in gevent) and whole thread (greenlet) is blocked after several requests because no one is reading from it, because reading is normally performed in connection thread (same stack) right after session callback is performed
I see two issues here:
- performing heavy lifting in session callback stack (recipe issue)
It is highly not recommended. I patched this locally with
self._root.on_reconnected=>self._in_background(self._root.on_reconnected) - using blocking socket for wake up Usually non-blocking socket is used and thread after woke up drains queue and socket. There can be ussies with long sending batches and other stuff. Not sure if it is worth fixing right now.
Maybe this got resolved (at least partially) in https://github.com/python-zk/kazoo/commit/111941371daec00a2ecb5d8c29b9b1d35d6aa4ff ?
This should have been fixed.