aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Authentication timeout with ssh

Open ramirezfranciscof opened this issue 2 years ago • 0 comments

Describe the bug

Calculations end up hanging in waiting status due to some AuthenticationException. The verdi process report command returns the following traceback:

+-> ERROR at 2022-06-13 17:28:48.468143+02:00
 | Traceback (most recent call last):
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 190, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/futures.py", line 178, in result
 |     raise self._exception
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/utils.py", line 188, in exponential_backoff_retry
 |     result = await coro()
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 190, in do_update
 |     job_info = await cancellable.with_interrupt(update_request)
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/utils.py", line 95, in with_interrupt
 |     result = await next(wait_iter)
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/tasks.py", line 619, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/futures.py", line 178, in result
 |     raise self._exception
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/tasks.py", line 282, in __step
 |     result = coro.throw(exc)
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 180, in updating
 |     await self._update_job_info()
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 132, in _update_job_info
 |     self._jobs_cache = await self._get_jobs_from_scheduler()
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 98, in _get_jobs_from_scheduler
 |     transport = await request
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/futures.py", line 260, in __await__
 |     yield self  # This tells Task to wait for completion.
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/tasks.py", line 349, in __wakeup
 |     future.result()
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/asyncio/futures.py", line 178, in result
 |     raise self._exception
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/engine/transports.py", line 89, in do_open
 |     transport.open()
 |   File "/home/framirez/Workenvs/aiida_qdens/aiida-core/aiida/transports/plugins/ssh.py", line 522, in open
 |     self._client.connect(self._machine, **connection_arguments)
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/site-packages/paramiko/client.py", line 435, in connect
 |     self._auth(
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/site-packages/paramiko/client.py", line 766, in _auth
 |     raise saved_exception
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/site-packages/paramiko/client.py", line 682, in _auth
 |     self._transport.auth_publickey(username, key)
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/site-packages/paramiko/transport.py", line 1634, in auth_publickey
 |     return self.auth_handler.wait_for_response(my_event)
 |   File "/home/framirez/miniconda3/envs/aiida_qdens/lib/python3.8/site-packages/paramiko/auth_handler.py", line 248, in wait_for_response
 |     raise AuthenticationException("Authentication timeout.")
 | paramiko.ssh_exception.AuthenticationException: Authentication timeout.

Steps to reproduce

This is the difficult part, I don't have a clear reproducible way to trigger this. It used to happen when I logged into my AiiDA computer via ssh, submitted the calculations and logged out (then the process would hang almost immediately). Now it is also happening when I submit the calculations from my AiiDA computer directly after waiting for some time.

Expected behavior

Depending on what is actually causing these, maybe there is a bug or maybe there is a problem with my setup or environment. In any case, I would say at least some of the following:

  1. Except the calculation instead of being stuck in waiting. If there is an extra problem with excepting (maybe this is a recoverable state via "pause/play" and if we excepted then this possibility is out) then we need create another state or something so as to call the attention of the user to the stuck process.
  2. A clearer error message; maybe catch the AuthenticationException in line 552 of aiida/transports/plugins/ssh.py and throw a more detailed exception printing the parameters or something like that.
  3. Fix the bug / automatically handle the situation that is producing the error.

Your environment

  • Operating system [e.g. Linux]: Ubuntu 18.04.6 LTS
  • Python version [e.g. 3.7.1]: Python 3.8.10
  • aiida-core version [e.g. 1.2.1]: AiiDA v2.0.0a1
  • PostgreSQL 13.3
  • RabbitMQ 3.8.19

(but it has happened in other setups as well)

ramirezfranciscof avatar Jun 14 '22 13:06 ramirezfranciscof