aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

🐛 Engine: daemon restart fails with `circus.exc.ConflictError`

Open mbercx opened this issue 2 years ago • 4 comments
trafficstars

Describe the bug

From time to time, I find my processes are no longer updating. When trying to restart the daemon, stopping the daemon fails due to a timeout:

❯ verdi daemon restart --reset
Profile: dev
Stopping the daemon... FAILED
Critical: Connection to the daemon timed out.

Restarting afterwards is not a problem:

❯ verdi daemon start
Starting the daemon with 1 workers... OK

I find no errors in the daemon logs, but in the circus logs I find the following (trimmed for brevity, full error messages log below):

2023-05-29 21:48:37 circus[2237] [INFO] Arbiter exiting
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 tornado.general[2246] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104370580> attached to closed socket: Socket operation on non-socket
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2237] [INFO] circusd-stats stopped
2023-05-29 21:48:38 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:39 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
[...]
2023-05-29 21:48:53 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 circus[2237] [INFO] aiida-dev stopped
Full Error log
2023-05-29 21:48:37 circus[2237] [INFO] Arbiter exiting
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 tornado.general[2246] [WARNING] Got events for stream <zmq.eventloop.zmqstream.ZMQStream object at 0x104370580> attached to closed socket: Socket operation on non-socket
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2246] [INFO] Stats streamer stopped
2023-05-29 21:48:37 circus[2237] [INFO] circusd-stats stopped
2023-05-29 21:48:38 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:39 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:40 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:41 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:42 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:43 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:44 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:45 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:46 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:47 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:48 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:49 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:50 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:51 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:52 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 tornado.application[2237] [ERROR] Exception in callback <bound method Arbiter.manage_watchers of <circus.arbiter.Arbiter object at 0x102866dc0>>
Traceback (most recent call last):
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/tornado/ioloop.py", line 919, in _run
    val = self.callback()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/circus/util.py", line 1038, in wrapper
    raise ConflictError("arbiter is already running %s command"
circus.exc.ConflictError: arbiter is already running arbiter_stop command
2023-05-29 21:48:53 circus[2237] [INFO] aiida-dev stopped

Steps to reproduce

I haven't found a way to consistently reproduce the problem yet, but it seems to occur more often when I am running processes that involve large data transfers. For the event above, a dozen orso calculation jobs had paused due to connection issues, which I then had restarted with verdi process play -a.

Your environment

  • Operating system [e.g. Linux]: macOS Monterey v12.5
  • Python version [e.g. 3.7.1]: Python 3.9.16
  • aiida-core version [e.g. 1.2.1]: sph/fix/6013/verdi-computer-test branch, it seems. Commit 47cd5155e, but I've also had it happen when running on main.
  • circus: 0.18.0

Additional context

From an offline discussion I know @unkcpz has also run into this issue.

mbercx avatar May 29 '23 20:05 mbercx

I've been running into this one quite consistently this morning, so reproduction has become less of an issue. 😅 Now also running on main.

One of my processes seems to have also excepted due to this issue:

  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/events.py", line 97, in run
    await self._callback(*self._args, **self._kwargs)
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/processes.py", line 567, in _run_task
    result = await coro(*args, **kwargs)
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/utils.py", line 245, in wrap
    return coro_or_fn(*args, **kwargs)
  File "/Users/mbercx/project/super/code/aiida-core/aiida/engine/processes/workchains/workchain.py", line 413, in _on_awaitable_finished
    self.resume()
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 98, in transition
    result = wrapped(self, *a, **kw)
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/processes.py", line 1112, in resume
    return self._state.resume(*args)  # type: ignore
  File "/Users/mbercx/.virtualenvs/super/lib/python3.9/site-packages/plumpy/process_states.py", line 339, in resume
    self._waiting_future.set_result(value)
asyncio.exceptions.InvalidStateError: invalid state

mbercx avatar May 30 '23 07:05 mbercx

Thanks for the report @mbercx . I am pretty sure that the error in the circus log is unrelated to any Processes excepting. So the second stack trace you mentioned should be independent. I think the circus exception is mostly innocent and should not have any negative effects to the system functioning.

The second stack trace is problematic however as it actually causes AiiDA processes to except. I am not sure what could cause this. Apparently the value that is getting set on the future is invalid, but annoyingly the exception raised by asyncio doesn't say what that value is.

If you say that you can consistently reproduce it, what would be super useful is to add some print statements to plumpy/process_states printing value on line 339. This way we get an idea what is getting passed and we can retrace where it is coming from and why it is missing a value.

sphuber avatar Jun 02 '23 10:06 sphuber

If you say that you can consistently reproduce it

This was for the circus-related error. I'll see if I will still bump into the other one.

mbercx avatar Jun 03 '23 09:06 mbercx

This may be of interest: https://github.com/circus-tent/circus/issues/1202

sphuber avatar Sep 22 '23 13:09 sphuber