aiida-core
aiida-core copied to clipboard
WorkChains get excepted because of daemons get overwhelmed
Describe the bug
I am not sure if the daemons getting overwhelmed is the reason behind it. But when I launch ~200 calculations together, they get excepted throwing a aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed
error. It is similar to this issue that I opened previously.
Following is the full error report
(aiida168) tthakur@theospc31:~$ verdi process report 918283
2023-02-03 18:05:18 [564668 | REPORT]: [918283|LinDiffusionWorkChain|setup]: launching WorkChain with pinball coefficients defined by <813418>
2023-02-03 18:05:18 [564669 | REPORT]: [918283|LinDiffusionWorkChain|run_process]: launching ReplayMDWorkChain<918287>
2023-02-03 18:05:21 [564673 | REPORT]: [918287|ReplayMDWorkChain|run_process]: launching FlipperCalculation<918302> iteration #1
2023-02-09 03:27:35 [572361 | REPORT]: [918287|ReplayMDWorkChain|report_error_handled]: FlipperCalculation<918302> failed with exit status 312: The stdout output file was incomplete probably because the calculation got interrupted.
2023-02-09 03:27:35 [572362 | REPORT]: [918287|ReplayMDWorkChain|report_error_handled]: Action taken: Restarting calculation...
2023-02-09 03:27:35 [572363 | REPORT]: [918287|ReplayMDWorkChain|inspect_process]: FlipperCalculation<918302> failed but a handler dealt with the problem, restarting
2023-02-09 03:27:35 [572364 | REPORT]: [918287|ReplayMDWorkChain|check_energy_fluctuations]: FlipperCalculation<918302> [check_energy_fluctuations]: Total energy fluctuations = 0.004842710000957595 < threshold (uuid: dd4cc43d-935f-4abe-abc9-d2646a108927 (pk: 918276) value: 180.0) OK
2023-02-09 03:27:35 [572365 | REPORT]: [918287|ReplayMDWorkChain|update_mdsteps]: FlipperCalculation<918302> ran 109190 steps (109190 done - 890810 to go).
2023-02-09 03:31:09 [572501 | REPORT]: [918287|ReplayMDWorkChain|run_process]: launching FlipperCalculation<924759> iteration #2
2023-02-14 21:24:05 [637726 | ERROR]: Traceback (most recent call last):
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/manage/external/rmq.py", line 208, in _continue
result = await super()._continue(communicator, pid, nowait, tag)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/process_comms.py", line 607, in _continue
proc = cast('Process', saved_state.unbundle(self._load_context))
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/persistence.py", line 60, in unbundle
return Savable.load(self, load_context)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/persistence.py", line 452, in load
return load_cls.recreate_from(saved_state, load_context)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 239, in recreate_from
call_with_super_check(process.init)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/utils.py", line 29, in call_with_super_check
wrapped(*args, **kwargs)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 159, in init
super().init()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/utils.py", line 16, in wrapper
wrapped(self, *args, **kwargs)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 298, in init
identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/communications.py", line 141, in add_rpc_subscriber
return self._communicator.add_rpc_subscriber(converted, identifier)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 215, in add_rpc_subscriber
return self._loop_scheduler.await_(
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 159, in await_
return self.await_submit(awaitable).result(timeout=self.task_timeout)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 36, in done
result = done_future.result()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 258, in __step
result = coro.throw(exc)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in proxy
return await awaitable
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 482, in add_rpc_subscriber
identifier = await msg_subscriber.add_rpc_subscriber(subscriber, identifier)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 123, in add_rpc_subscriber
rpc_queue = await self._channel.declare_queue(exclusive=True, arguments=self._rmq_queue_arguments)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/robust_channel.py", line 173, in declare_queue
queue = await super().declare_queue(
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/channel.py", line 325, in declare_queue
await queue.declare(timeout=timeout)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/queue.py", line 92, in declare
self.declaration_result = await asyncio.wait_for(
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 703, in queue_declare
return await self.rpc(
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/base.py", line 168, in wrap
return await self.create_task(func(self, *args, **kwargs))
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/base.py", line 25, in __inner
return await self.task
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 284, in __await__
yield self # This tells Task to wait for completion.
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
future.result()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 256, in __step
result = coro.send(None)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 121, in rpc
raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None
2023-02-14 21:24:06 [637729 | ERROR]: Traceback (most recent call last):
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/manage/external/rmq.py", line 208, in _continue
result = await super()._continue(communicator, pid, nowait, tag)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/process_comms.py", line 613, in _continue
await proc.step_until_terminated()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 1230, in step_until_terminated
await self.step()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 1216, in step
self.transition_to(next_state)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 335, in transition_to
self.transition_failed(initial_state_label, label, *sys.exc_info()[1:])
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 351, in transition_failed
raise exception.with_traceback(trace)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 320, in transition_to
self._enter_next_state(new_state)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 386, in _enter_next_state
self._fire_state_event(StateEventHook.ENTERED_STATE, last_state)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/base/state_machine.py", line 299, in _fire_state_event
callback(self, hook, state)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 326, in <lambda>
lambda _s, _h, from_state: self.on_entered(cast(Optional[process_states.State], from_state)),
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiida/engine/processes/process.py", line 390, in on_entered
super().on_entered(from_state)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/processes.py", line 700, in on_entered
self._communicator.broadcast_send(body=None, sender=self.pid, subject=subject)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/plumpy/communications.py", line 175, in broadcast_send
return self._communicator.broadcast_send(body, sender, subject, correlation_id)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/threadcomms.py", line 258, in broadcast_send
result = self._loop_scheduler.await_(
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 159, in await_
return self.await_submit(awaitable).result(timeout=self.task_timeout)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 446, in result
return self.__get_result()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
raise self._exception
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 36, in done
result = done_future.result()
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/futures.py", line 201, in result
raise self._exception
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 256, in __step
result = coro.send(None)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/pytray/aiothreads.py", line 178, in proxy
return await awaitable
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 522, in broadcast_send
result = await publisher.broadcast_send(body, sender, subject, correlation_id)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/communicator.py", line 66, in broadcast_send
return await self.publish(message, routing_key=defaults.BROADCAST_TOPIC, mandatory=False)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/kiwipy/rmq/messages.py", line 209, in publish
result = await self._exchange.publish(message, routing_key=routing_key, mandatory=mandatory)
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aio_pika/exchange.py", line 233, in publish
return await asyncio.wait_for(
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/asyncio/tasks.py", line 442, in wait_for
return await fut
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 508, in basic_publish
async with self.lock:
File "/home/tthakur/miniconda3/envs/aiida168/lib/python3.9/site-packages/aiormq/channel.py", line 90, in lock
raise ChannelInvalidStateError("%r closed" % self)
aiormq.exceptions.ChannelInvalidStateError: <Channel: "4"> closed
Steps to reproduce
Steps to reproduce the behavior:
- Launch a lot of WorkChains (>100) withing a few hours.
- Wait for all the processes to leave the
Created
state and start running properly. - Optionally restart the daemon, but this is not strictly required.
- Most WorkChains will except at this point.
Expected behavior
Nothing should happen, the workchains should run normally and not get excepted.
Your environment
- Operating system: Ubuntu 22.04
- Python version: 3.9.13
- aiida-core version: 1.6.8
- RabbitMQ: 3.7.28
- PostgreSQL: 14.5
Additional context
For some reason I am seeing this issue much more frequently now. It used to happen once in a blue moon only if I restarted the daemons, but last time it happened I didn't do anything, the WCs just got excepted after I left the machine alone over the weekend. My environment is still the same, only my aiida database has become bigger.
Hi @tsthakur, I'm experiencing the same issue.
My environment details are:
- Operating system: MacOS
- Python version: 3.10.9
- aiida-core version: 2.2.2
- RabbitMQ version: 3.11.8
I get a similar report message:
[22m2023-02-21 15:56:14 [3967 | REPORT]: [14092|VaspWorkChain|run_process]: launching VaspCalculation<15027> iteration #1
2023-02-21 16:25:23 [4125 | ERROR]: Traceback (most recent call last):
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiida/manage/external/rmq/launcher.py", line 90, in _continue
result = await super()._continue(communicator, pid, nowait, tag)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/process_comms.py", line 604, in _continue
proc = cast('Process', saved_state.unbundle(self._load_context))
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/persistence.py", line 58, in unbundle
return Savable.load(self, load_context)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/persistence.py", line 450, in load
return load_cls.recreate_from(saved_state, load_context)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/processes.py", line 244, in recreate_from
call_with_super_check(process.init)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/base/utils.py", line 29, in call_with_super_check
wrapped(*args, **kwargs)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiida/engine/processes/process.py", line 185, in init
super().init()
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/base/utils.py", line 16, in wrapper
wrapped(self, *args, **kwargs)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/processes.py", line 303, in init
identifier = self._communicator.add_rpc_subscriber(self.message_receive, identifier=str(self.pid))
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/plumpy/communications.py", line 141, in add_rpc_subscriber
return self._communicator.add_rpc_subscriber(converted, identifier)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/kiwipy/rmq/threadcomms.py", line 215, in add_rpc_subscriber
return self._loop_scheduler.await_(
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/pytray/aiothreads.py", line 164, in await_
return self.await_submit(awaitable).result(timeout=self.task_timeout)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 234, in __step
result = coro.throw(exc)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/pytray/aiothreads.py", line 178, in coro
res = await awaitable
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/kiwipy/rmq/communicator.py", line 482, in add_rpc_subscriber
identifier = await msg_subscriber.add_rpc_subscriber(subscriber, identifier)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/kiwipy/rmq/communicator.py", line 123, in add_rpc_subscriber
rpc_queue = await self._channel.declare_queue(exclusive=True, arguments=self._rmq_queue_arguments)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aio_pika/robust_channel.py", line 173, in declare_queue
queue = await super().declare_queue(
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aio_pika/channel.py", line 325, in declare_queue
await queue.declare(timeout=timeout)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aio_pika/queue.py", line 92, in declare
self.declaration_result = await asyncio.wait_for(
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 408, in wait_for
return await fut
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/channel.py", line 703, in queue_declare
return await self.rpc(
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/base.py", line 168, in wrap
return await self.create_task(func(self, *args, **kwargs))
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/base.py", line 25, in __inner
return await self.task
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/futures.py", line 285, in __await__
yield self # This tells Task to wait for completion.
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
future.result()
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception.with_traceback(self._exception_tb)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/Users/ireaml/miniconda3/envs/aiida/lib/python3.10/site-packages/aiormq/channel.py", line 121, in rpc
raise ChannelInvalidStateError("writer is None")
aiormq.exceptions.ChannelInvalidStateError: writer is None