aiida-core SSH issues lead to processes becoming unreachable

After restarting my local machine, I started some daemons before running ssh-add -K on any appropriate ssh keys.

As such, all waiting processes ended up with errors of the form:

+-> ERROR at 2021-03-10 14:15:04.968898+00:00
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
...

 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/ed25519key.py", line 96, in _parse_signing_key_data
 |     raise PasswordRequiredException(
 | paramiko.ssh_exception.PasswordRequiredException: Private key file is encrypted

Fine in itself, there are multiple errors like this per process so I assume the exponential backoff mechanism is preventing it from trying and failing to connect repeatedly. I stopped the daemon, fixed the ssh keys, restarted the daemon and the connection issues were resolved.

The issue is now any processes which had that error are showing as "unreachable". I have created new processes since which can be paused and played with no issue, but all the processes which were queued up when I made the error with the ssh key can no longer be paused/played.

I'm unsure whether this is expected behaviour with the backoff mechanism or a bug. I'm also unsure whether or not this means all my jobs queued on the HPC need cancelling and resubmitting so they have an associated "reachable" process with them.

Mar 10 '21 18:03 mjclarke94

thanks for reporting. @sphuber @chrisjsewell any ideas why this would lead processes to become unreachable?

Mar 11 '21 12:03 ltalirz

When you say they are unreachable it is when you try to run verdi process play/pause on them, right? This should not happen in principle, so would most likely point to a bug. Can I ask what version of aiida-core you are using. You can run verdi --version to determine this (as long as you did not install from a particular branch directly from the repository).

Try to restart the daemon once more with verdi daemon restart --reset and wait a bit for things to get running again (a minute or so). Then try to play them again. If they are still marked as unreachable, here is a trick that you can use to get them running again. Disclaimer this should not be used regularly as it can cause problems if used incorrectly.

In a verdi shell do the following

from aiida.manage.manager import get_manager
controller = get_manager().get_process_controller()
pks = []  # Add the pks to this list of the processes that have become unreachable. Warning do **not** add processes that are actually running and are reachable
for pk in pks:
    controller.continue_process(pk, no_reply=True, nowait=True)

Mar 11 '21 14:03 sphuber

Sorry for the slow reply. I've installed from develop, specifically commit d762522. The daemon restart didn't help, nor did restarting my machine and the backend services (postgres/rabbiitmq).

I was in a rush so just deleted and resubmitted the processes (which in hindsight is probably not all that helpful for bug hunting....sorry!), but will try the snippet above if I inadvertently recreate it!

Mar 17 '21 11:03 mjclarke94

Unsure if related, but I seem to very frequently be getting the following on many of my jobs...

+-> ERROR at 2021-03-18 08:05:18.028135+00:00
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2211, in _check_banner
 |     buf = self.packetizer.readline(timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/packet.py", line 380, in readline
 |     buf += self._read_timeout(timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/packet.py", line 607, in _read_timeout
 |     x = self.__socket.recv(128)
 | ConnectionResetError: [Errno 54] Connection reset by peer
 |
 | During handling of the above exception, another exception occurred:
 |
 | Traceback (most recent call last):
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 190, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/utils.py", line 95, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 609, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 258, in __step
 |     result = coro.throw(exc)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 284, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/tasks.py", line 328, in __wakeup
 |     future.result()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/asyncio/futures.py", line 201, in result
 |     raise self._exception
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/aiida/transports/plugins/ssh.py", line 438, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/client.py", line 406, in connect
 |     t.start_client(timeout=timeout)
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 660, in start_client
 |     raise e
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2039, in run
 |     self._check_banner()
 |   File "/Users/matt/.pyenv/versions/3.9.0/lib/python3.9/site-packages/paramiko/transport.py", line 2215, in _check_banner
 |     raise SSHException(
 | paramiko.ssh_exception.SSHException: Error reading SSH protocol banner[Errno 54] Connection reset by peer

I'm running on a mac rather than a persistent server so am wondering if it going to sleep overnight is causing backend processes to be interrupted in a way aiida wouldn't be able to handle safely? I suggest that primarily because things run fine running during the day but I tend to awake to a big stack of errors rather than any technical insight I have to offer!

Mar 18 '21 10:03 mjclarke94

When your laptop goes to sleep, the daemon should not actually be running. This should not be a problem as AiiDA is designed to be able to deal with this and simply continue the processes where it left off last time the daemon was stopped. That being said, it can be the case that when your computer wakes up and the daemon restarts, there is a problem with the SSH agent or keys, causing connections to the remote machine to fail, which is the exception that you see here. Ultimately, this should not be a problem since the exponential backoff mechanism will retry. If it keeps failing and the process is paused as a result, you can try to restart the daemon.

Mar 18 '21 11:03 sphuber

@mjclarke94 semi related to this as processes should not become unreachable, but are you using the proxy_command configuration (to have a jumphost)?

May 18 '21 13:05 dev-zero

Nope, just plain old SSH.

May 18 '21 13:05 mjclarke94

Just to log it: I also got unreachable processes in AiiDA version 1.5.0 (sorry for the old version) also after a single process crashed with

File "/u/r/rbertoss/.virtualenvs/aiida/lib/python3.7/site-packages/aiida/orm/nodes/data/array/trajectory.py", line 209, in _validate
    f'The TrajectoryData did not validate. Error: {type(exception).__name__} with message {exception}'
aiida.common.exceptions.ValidationError: The TrajectoryData did not validate. Error: MemoryError with message Unable to allocate 395. MiB for an array with shape (51710400,) and data type float64

and the snipped posted was able to get the processes start again

Aug 22 '22 12:08 rikigigi

aiida-core aiida-core copied to clipboard

SSH issues lead to processes becoming unreachable

aiida-core
aiida-core copied to clipboard