kazoo icon indicating copy to clipboard operation
kazoo copied to clipboard

Deadlock with TreeCache and reconnection

Open nekto0n opened this issue 8 years ago • 1 comments

Hi there! First of all, thank you for such a great recipe as TreeCache. We recently tried to use it, but started experiencing ussies upon reconnection with large trees. After reconnecting every request fails with ConnectionLoss error. I managed to capture a stacktrace:

  File "/venv/lib/python2.7/site-packages/gevent/greenlet.py", line 327, in run
    result = self._run(*self.args, **self.kwargs)
  File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 473, in zk_loop
    if retry(self._connect_loop, retry) is STOP_CONNECTING:
  File "/venv/lib/python2.7/site-packages/kazoo/retry.py", line 123, in __call__
    return func(*args, **kwargs)
  File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 512, in _connect_loop
    status = self._connect_attempt(host, port, retry)
  File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 539, in _connect_attempt
    read_timeout, connect_timeout = self._connect(host, port)
  File "/venv/lib/python2.7/site-packages/kazoo/protocol/connection.py", line 646, in _connect
    client._session_callback(KeeperState.CONNECTED)
  File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 467, in _session_callback
    self._make_state_change(KazooState.CONNECTED)
  File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 440, in _make_state_change
    remove = listener(state)
  File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 179, in _session_watcher
    self._root.on_reconnected()
  File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 217, in on_reconnected
    child.on_reconnected()
  File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 215, in on_reconnected
    self._refresh()
  File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 247, in _refresh
    self._refresh_data()
  File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 251, in _refresh_data
    self._call_client('get', self._path)
  File "/venv/lib/python2.7/site-packages/kazoo/recipe/cache.py", line 264, in _call_client
    method(path, *args, **kwargs).rawlink(callback)
  File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 1065, in get_async
    async_result)
  File "/venv/lib/python2.7/site-packages/kazoo/client.py", line 547, in _call
    write_sock.send(b'\0')
  File "/venv/lib/python2.7/site-packages/gevent/socket.py", line 443, in send
    self._wait(self._write_event)
  File "/venv/lib/python2.7/site-packages/gevent/socket.py", line 300, in _wait
    self.hub.wait(watcher)
  File "/venv/lib/python2.7/site-packages/gevent/hub.py", line 348, in wait
    result = waiter.get()
  File "/venv/lib/python2.7/site-packages/gevent/hub.py", line 575, in get
    return self.hub.switch()
  File "/venv/lib/python2.7/site-packages/gevent/hub.py", line 338, in switch
    return greenlet.switch(self)

I think here what happens:

  • kazoo calls session callback
  • tree calls self._root.on_reconnected()
  • which in turn issues get and get_children calls in kazoo
  • these calls are put in queue and '\0' is written into "wake up" socket
  • socket is "blocking" (it appears as blocking in gevent) and whole thread (greenlet) is blocked after several requests because no one is reading from it, because reading is normally performed in connection thread (same stack) right after session callback is performed

I see two issues here:

  • performing heavy lifting in session callback stack (recipe issue) It is highly not recommended. I patched this locally with self._root.on_reconnected => self._in_background(self._root.on_reconnected)
  • using blocking socket for wake up Usually non-blocking socket is used and thread after woke up drains queue and socket. There can be ussies with long sending batches and other stuff. Not sure if it is worth fixing right now.

nekto0n avatar Jul 19 '17 13:07 nekto0n

Maybe this got resolved (at least partially) in https://github.com/python-zk/kazoo/commit/111941371daec00a2ecb5d8c29b9b1d35d6aa4ff ?

teeeg avatar Jun 06 '19 14:06 teeeg

This should have been fixed.

nekto0n avatar Aug 06 '23 09:08 nekto0n