aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`verdi process kill` takes long time for erroneous processes

Open GeigerJ2 opened this issue 1 year ago • 9 comments

As noted by @khsrali and me when trying out the new FirecREST implementation. So maybe not the most general example, but this is where it occurred for me: Say, one installs aiida-firecrest and submits a job using that, but forgets to run verdi daemon restart before, leading to the requested transport plugin not being available. verdi daemon logshow shows the

expected exception (though, doesn't actually matter)

Error: iteration 1 of do_upload excepted, retrying after 20 seconds
Traceback (most recent call last):
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/orm/authinfos.py", line 175, in get_transport
    transport_class = TransportFactory(transport_type)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/factories.py", line 432, in TransportFactory
    entry_point = BaseFactory(entry_point_group, entry_point_name, load=load)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/factories.py", line 75, in BaseFactory
    return load_entry_point(group, name)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/entry_point.py", line 276, in load_entry_point
    entry_point = get_entry_point(group, name)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/plugins/entry_point.py", line 324, in get_entry_point
    raise MissingEntryPointError(f"Entry point '{name}' not found in group '{group}'")
aiida.common.exceptions.MissingEntryPointError: Entry point 'firecrest' not found in group 'aiida.transports'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py", line 203, in exponential_backoff_retry
    result = await coro()
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/calcjobs/tasks.py", line 85, in do_upload
    with transport_queue.request_transport(authinfo) as request:
  File "/usr/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/transports.py", line 78, in request_transport
    transport = authinfo.get_transport()
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/orm/authinfos.py", line 177, in get_transport
    raise exceptions.ConfigurationError(f'transport type `{transport_type}` could not be loaded: {exception}')
aiida.common.exceptions.ConfigurationError: transport type `firecrest` could not be loaded: Entry point 'firecrest' not found in group 'aiida.transports'

and the job is stuck in the state

⏵ Waiting Waiting for transport task: upload

as expected. (though, actually, here it could also show that the job excepted, rather than being stuck in the waiting state?)

Now, running verdi process kill leads to the command being stuck for minutes on end, while

`verdi daemon logshow` gives the following output

plumpy.process_states.KillInterruption: Killed through `verdi process kill`

Task exception was never retrieved
future: <Task finished name='Task-49' coro=<interruptable_task.<locals>.execute_coroutine() done, defined at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py:132> exception=KillInterruption('Killed through `verdi process kill`')>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/utils.py", line 142, in execute_coroutine
    future.result(),
plumpy.process_states.KillInterruption: Killed through `verdi process kill`

showing that plumpy reports that it has killed the process, but it's not being picked up by the daemon. Opening this issue for future reference. Please correct me if I got something wrong, @khsrali, or if you have another, simpler and more general example where this occurs.

EDIT: To add here, running verdi daemon stop then

leads to an update in `verdi daemon logshow`

Exception in callback Process.kill.<locals>.done(<_GatheringFu...celledError()>) at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py:369
handle: <Handle Process.kill.<locals>.done(<_GatheringFu...celledError()>) at /home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py:369>
Traceback (most recent call last):
  File "/usr/lib/python3.10/asyncio/events.py", line 80, in _run
    self._context.run(self._callback, *self._args)
  File "/home/geiger_j/aiida_projects/aiida-icon-clm/git-repos/aiida-core/src/aiida/engine/processes/process.py", line 370, in done
    is_all_killed = all(done_future.result())
asyncio.exceptions.CancelledError

and starting the daemon again then leads to the process actually entering the QUEUED state (rather than being killed), so the verdi process kill command is somewhat lost.

GeigerJ2 avatar Jul 09 '24 14:07 GeigerJ2

though, actually, here it could also show that the job excepted, rather than being stuck in the waiting state?

I think it should still say that it is in the waiting state, although it would be better if it was updated with the number of failures. So ideally it would show:

⏵ Waiting         Waiting for transport task: upload (failed 2 times)

or something like that. Once it hits the maximum number of retries (5 by default) it will go in the paused state. The reason for this functionality is that this way transient problems, such as failing to connect due to connection problems, won't fully except the calcjob. Once the problem is fixed, it can be resumed again.

and starting the daemon again then leads to the process actually entering the QUEUED state (rather than being killed), so the verdi process kill command is somewhat lost.

The verdi process kill command performs Remote Procedure Call (RPC), which means it is trying to execute a method of the Process. Since that process is running inside a daemon worker (which is a different system process) it needs to be done as an RPC over RabbitMQ. This then only works if that Process is actually running inside a daemon worker (the reason why the daemon needs to be running for verdi process kill to work). This is why the kill command seems "to get lost" when the daemon is restart.

The real problem I suspect here is in the handling of the Process.kill RPC call. https://github.com/aiidateam/aiida-core/blob/a5da4eda131f844c3639bdb01a256b9e9a7873a2/src/aiida/engine/processes/process.py#L332

Since the process is already waiting for the next try of the failed upload task, the RPC to call kill is not getting through. The verdi process kill command waits for the RPC to be completed by default, so if that call is not going through, it seems to be hanging. I think one thing we should already fix is that the kill procedure for CalcJob should be subclasses from the Process base class. If the CalcJob is in the waiting state with a failed attempt of one of the transport tasks and it gets a kill command, it should just kill immediately and not wait for the next attempt.

sphuber avatar Jul 10 '24 11:07 sphuber

Thanks for the clarification, @sphuber, and sorry for the slow reply.

I think it should still say that it is in the waiting state, although it would be better if it was updated with the number of failures. So ideally it would show:

⏵ Waiting         Waiting for transport task: upload (failed 2 times)

or something like that. Once it hits the maximum number of retries (5 by default) it will go in the paused state. The reason for this functionality is that this way transient problems, such as failing to connect due to connection problems, won't fully except the calcjob. Once the problem is fixed, it can be resumed again.

Actually, that makes sense, yes.

The verdi process kill command performs Remote Procedure Call (RPC), which means it is trying to execute a method of the Process. Since that process is running inside a daemon worker (which is a different system process) it needs to be done as an RPC over RabbitMQ. This then only works if that Process is actually running inside a daemon worker (the reason why the daemon needs to be running for verdi process kill to work). This is why the kill command seems "to get lost" when the daemon is restart.

Pinging @khsrali here for one example how and why RMQ is used, interesting!

The real problem I suspect here is in the handling of the Process.kill RPC call.

https://github.com/aiidateam/aiida-core/blob/a5da4eda131f844c3639bdb01a256b9e9a7873a2/src/aiida/engine/processes/process.py#L332

Since the process is already waiting for the next try of the failed upload task, the RPC to call kill is not getting through. The verdi process kill command waits for the RPC to be completed by default, so if that call is not going through, it seems to be hanging. I think one thing we should already fix is that the kill procedure for CalcJob should be subclasses from the Process base class. If the CalcJob is in the waiting state with a failed attempt of one of the transport tasks and it gets a kill command, it should just kill immediately and not wait for the next attempt.

Will look into this!

GeigerJ2 avatar Jul 21 '24 10:07 GeigerJ2

I just faced this issue again, in a different scenario, but apparently the same cause (connection):

Steps to reproduce:

  1. Create a fresh profile
  2. Submit 16 jobs to daemon
  3. ~Immediately after, delete the profile verdi profile delete~ [UPDATE I had a wrong profile name in my script, please skip this step]
  4. Delete all groups as well verdi group delete
  5. verdi process kill -a --no-wait got stuck for 10 minutes until I press Ctrl+C. No process was killed.

Note: In my particular case, verdi daemon logshow:

  File "/usr/lib/python3.11/asyncio/futures.py", line 203, in result
    raise self._exception.with_traceback(self._exception_tb)
  File "/home/khosra_a/development/repo/aiida-core/src/aiida/engine/transports.py", line 87, in do_open
    transport.open()
  File "/home/khosra_a/development/repo/aiida-core/src/aiida/transports/plugins/ssh.py", line 488, in open
    connection_arguments['sock'] = proxy_client.get_transport().open_channel(
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/khosra_a/development/venv/aiida-firecrest/lib/python3.11/site-packages/paramiko/transport.py", line 1115, in open_channel
    raise e
paramiko.ssh_exception.ChannelException: ChannelException(2, 'Connect failed')

And the following reports no job between RMQ an d daemon was found:

verdi daemon stop
verdi process repair
verdi daemon start

There is just no way to get rid of these parasitic jobs. [UPDATE still possible to delete a running process node, I'm not sure what consequences it might have]

khsrali avatar Sep 23 '24 08:09 khsrali

Hold on, how can you do verdi group delete and verdi process kill after you delete the profile?! That doesn't make any sense.

sphuber avatar Sep 23 '24 17:09 sphuber

Ah sorry, I had one other profile in my script than the one I deleted, stupid :rofl:

Anyways, the bug is reproducible.

Note: It's possible to get rid of them by verdi group delete <my_group> --delete-nodes or verdi node delete <pk>

@sphuber What is the consequence of deleting a process node instead of killing it? Does it remain somewhere as a ghost or it's really dead.

khsrali avatar Sep 24 '24 10:09 khsrali

I added breakpoints here and there, it seems the stucking point is here: https://github.com/aiidateam/aiida-core/blob/72a6b183b8048d5c31b2b827efe6b8b969038e28/src/aiida/engine/processes/control.py#L282

khsrali avatar Sep 24 '24 12:09 khsrali

Remember that a node is just a proxy for the state of a process. When a process is running, it just persists its state on the node in the database. What actually determines the lifetime of a process is its associated task with RabbitMQ. If you delete the node, the task is not (immediately) affected. If the task was not yet with a daemon worker, the task will be destroyed as soon as a daemon worker picks it up, as it won't be able to load the associated node. If the process was already running with a worker, it will hit an exception as soon as it tries to store a change in state on the node as it no longer exists. At that point the process task is destroyed.

So, in short: you can delete a process node and if it had an existing task with RMQ, it will eventually be destroyed. NOTE: if it is a CalcJob, killing the node means that the job won't be canceled with the scheduler, so this approach may not be desirable for calcjobs.

There is just no way to get rid of these parasitic jobs.

I don't understand what you mean with "parasitic jobs" here.

I added breakpoints here and there, it seems the stucking point is here:

In your test case here, the problem is that you have deleted the node and most likely by the time that you call verdi process kill the task will already have been destroyed according to mechanism explained above. The verdi process kill command needs to actually call Process.kill on the active process, but that is clearly impossible if that process is no longer running. This is why it is hanging. I am not quite sure why it is not raising an UnreachableEntity error. We may have to update the command to first check that it can actually reach the process, or somehow make sure it doesn't try for the process to respond if a task for it no longer actually exists.

sphuber avatar Sep 24 '24 13:09 sphuber

Thanks a lot for explanation. Very nice.

I don't understand what you mean with "parasitic jobs" here.

sorry, a wrong term.

In your test case here, the problem is that you have deleted the node and most likely by the time that you call verdi process kill the task will already have been destroyed according to mechanism explained above. The verdi process kill command needs to actually call Process.kill on the active process, but that is clearly impossible if that process is no longer running. This is why it is hanging. I am not quite sure why it is not raising an UnreachableEntity error. We may have to update the command to first check that it can actually reach the process, or somehow make sure it doesn't try for the process to respond if a task for it no longer actually exists.

I see. To understand this, I tried again this time I don't delete any nodes. And the problem is still there.

To reproduce:

Steps to reproduce:

  1. Have a wrong setup with paramiko such that connection won't be possible.
  2. Submit a few jobs to daemon
  3. verdi process kill -a --no-wait get's stuck at aiida-core/src/aiida/engine/processes/control.py#L282 for a very long time. (I press Ctrl +C)

verdi process report 9167:

*** 9167: None
*** Scheduler output: N/A
*** Scheduler errors: N/A
*** 1 LOG MESSAGES:
+-> ERROR at 2024-09-25 10:37:43.885004+02:00
 | Traceback (most recent call last):
 |   File "/home/khosra_a/development/repo/aiida-core/src/aiida/engine/utils.py", line 205, in exponential_backoff_retry
 |     result = await coro()
 |              ^^^^^^^^^^^^
 |   File "/home/khosra_a/development/repo/aiida-core/src/aiida/engine/processes/calcjobs/tasks.py", line 86, in do_upload
 |     transport = await cancellable.with_interrupt(request)
 |                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 |   File "/home/khosra_a/development/repo/aiida-core/src/aiida/engine/utils.py", line 115, in with_interrupt
 |     result = await next(wait_iter)
 |              ^^^^^^^^^^^^^^^^^^^^^
 |   File "/usr/lib/python3.11/asyncio/tasks.py", line 615, in _wait_for_one
 |     return f.result()  # May raise f.exception().
 |            ^^^^^^^^^^
 |   File "/usr/lib/python3.11/asyncio/futures.py", line 203, in result
 |     raise self._exception.with_traceback(self._exception_tb)
 |   File "/home/khosra_a/development/repo/aiida-core/src/aiida/engine/transports.py", line 87, in do_open
 |     transport.open()
 |   File "/home/khosra_a/development/repo/aiida-profiling/src/profiling/inject_tool.py", line 20, in wrapper
 |     result = func(self, *args, **kwargs)
 |              ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 |   File "/home/khosra_a/development/repo/aiida-core/src/aiida/transports/plugins/ssh.py", line 493, in open
 |     if proxycmdstring:
 |                        
 |   File "/home/khosra_a/development/venv/aiida-firecrest/lib/python3.11/site-packages/paramiko/transport.py", line 1115, in open_channel
 |     raise e
 | paramiko.ssh_exception.ChannelException: ChannelException(2, 'Connect failed')

khsrali avatar Sep 25 '24 08:09 khsrali

Ok, that is a good concrete example. What is happening here is an exception that happens during one of the CalcJob's tasks, which are upload, submit, update and retrieve. Instead of letting the exception bubble up and except the entire process, the exception is caught and the task is scheduled to be retried later on. When it fails 5 times in a row, the process is paused. This is called the exponential backoff mechanism (EBM) and is done because often these failures can be transient (most often due to connection issues).

When the process is paused, I think the code may contain a bug where the kill action can never actually be called because the process needs to be playing for that to happen. But I am not a 100% sure, we would have to check.

In your example though, it looks like the process hasn't hit the 5 times yet and is just waiting for the next try. Perhaps here there is also already something preventing from the kill command being executed. Again not sure exactly what but certainly plausible.

In both cases, we have to verify what (if anything) is blocking the RPC from being executed and when we find it, find a way to let the RPC take precedence and still go through, despite the process being handled by the EBM.

I am not sure when I will have time to look into this myself, but hopefully this gives you more information to start digging into how this works a bit more yourself.

sphuber avatar Sep 25 '24 20:09 sphuber

This should be now fixed with PR #6793. The problem verdi process kill was getting stuck was because the timeout was passed to the wrong kiwipy.Future. It was a nested future (a future with a future with a result) and the timeout was passed to the outer future. Now it is passed to the inner one so the timeout works properly. We also now cancel the old kill command if it is resend. This gives the option to the user to adapt the EBM parameters in the verdi config and to send a kill command again (not most user friendly way but works). Also there is now the force kill option to just kill the process without waiting at all for the response.

agoscinski avatar May 02 '25 05:05 agoscinski