aiida-core
aiida-core copied to clipboard
Respect disabled computer and set processes to pause
simple fix for #1550
ok, it seems to be a bit trickier: when submitting a job using verdi run submit.py
, the process will hang. When trying to kill it, it will hang again and does not terminate on ctrl+c
or SIGTERM
.
$ verdi run submit.py
^C12/12/2019 10:20:43 PM <404578> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 29
^C12/12/2019 10:21:58 PM <404578> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 29
[...]
ok, it seems to be a bit trickier: when submitting a job using
verdi run submit.py
, the process will hang. When trying to kill it, it will hang again and does not terminate onctrl+c
orSIGTERM
.
And the same script can be ran and then killed with CTRL+C in develop
so without your changes?
ok, it seems to be a bit trickier: when submitting a job using
verdi run submit.py
, the process will hang. When trying to kill it, it will hang again and does not terminate onctrl+c
orSIGTERM
.And the same script can be ran and then killed with CTRL+C in
develop
so without your changes?
yes
This looks ready to go, thanks a lot @dev-zero . Also thanks for adapting the doc string of the upload task that I sloppily forgot about :) I will try today to test it out in the wild as well. The only thing that still worries me a bit is your report that you could not kill a running process properly with these changes. I have not been able to reproduce this. Could you maybe still share what you ran in that submit.py
script? Or have the problems gone away since?
Some minor changes. I checked out your branch locally and launched a calculation job with
run
and then interrupted it and in my case it works just fine. The process ends up in the killed state. Can you share thesubmit.py
contents?
The following runs without problems without the patch (well, obvious) and hangs with it:
from aiida.plugins import CalculationFactory
from aiida.engine import run
from aiida import orm
code = load_code(14)
inputs = {
'code': code,
'x': orm.Int(1),
'y': orm.Int(1),
'metadata': {
'options': {
'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 1}
}
}
}
result = run(CalculationFactory('arithmetic.add'), **inputs)
what I get:
(aiida) tiziano@scarlet ~/work/aiida $ verdi run submit.py
(aiida) tiziano@scarlet ~/work/aiida $ verdi process list -a
PK Created Process label Process State Process status
---- --------- ------------------------ --------------- -----------------------------------------------------------
[...]
57 10s ago ArithmeticAddCalculation ⏹ Finished [0]
Info: last time an entry changed state: 9s ago (at 09:14:20 on 2019-12-18)
(aiida) tiziano@scarlet ~/work/aiida $ verdi computer disable localhost [email protected]
Info: Computer 'localhost' disabled for user Tiziano Müller ([email protected]).
(aiida) tiziano@scarlet ~/work/aiida $ verdi run submit.py
|
^C12/18/2019 10:25:46 AM <512974> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 63
12/18/2019 10:27:26 AM <512974> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 63
Killed
... first CRITICAL
is from Ctrl+c, second one from sending SIGTERM
, then SIGKILL
.
When you say that the script runs normally without and hangs with patch, is that when you just run it and let it run? And is that with the computer disabled? Or, finally, do you only see the difference if you CTRL+C it?
If you see the hanging with the patch when you disable the computer, I think I know why. In that case the process is paused and even for the engine to kill it, it needs to be running first. If this is the case, we might need to have a kill action automatically resume a paused process first. However, I also notice that you decorated the task_kill_job
. I realize we decided this because the disable should also be used by users to indicate that the machine is unreachable and the engine should not even try, but this then makes it obvious that a kill command will hang in a blocking run
until the computer is enabled and the process is resumed. I am not sure yet where to "fix" this impasse.
When you say that the script runs normally without and hangs with patch, is that when you just run it and let it run? And is that with the computer disabled? Or, finally, do you only see the difference if you CTRL+C it?
What I try:
- run the script on AiiDA develop with computer enabled. Works.
- run the script on AiiDA develop with computer disabled. Works.
- run the script on AiiDA this branch with computer enabled. Works.
- run the script on AiiDA this branch with computer disabled. Hangs. Meaning the
verdi run
command does not terminate and can't be terminated by Ctrl+C orSIGTERM
.
If you see the hanging with the patch when you disable the computer, I think I know why. In that case the process is paused and even for the engine to kill it, it needs to be running first. If this is the case, we might need to have a kill action automatically resume a paused process first. However, I also notice that you decorated the
task_kill_job
. I realize we decided this because the disable should also be used by users to indicate that the machine is unreachable and the engine should not even try, but this then makes it obvious that a kill command will hang in a blockingrun
until the computer is enabled and the process is resumed. I am not sure yet where to "fix" this impasse.
Yes, I came to the same conclusion. Intuitively I would have said that PAUSED-UPLOAD/SUBMIT/...
need to be full (and visible) states in the state machine rather than some kind of metastate (paused meaning: awaiting state change in the state machine) as it seems to be now.
And the reason why you need multiple states (or a hierarchical) state machine is because you have to remember where to go to on resume.
Submitting jobs to a disabled computer should generate a warning as requested in #664 but still end up in the queue and verdi run
should return.
- run the script on AiiDA this branch with computer disabled. Hangs. Meaning the verdi run command does not terminate and can't be terminated by Ctrl+C or SIGTERM.
This is a tricky one. With the current design and intention of run
it is supposed to hang. It is not supposed to return control to the interpreter until the process has finished. Since it is paused it "hangs". Now, I do think that it should still be interruptable so Ctrl+C should just kill the process and terminate the interpreter, so that is something to fix. But I do not think that if a process that is launched through run
is paused, the interpreter should just close, because that process is lost forever. It won't ever be resumed by anything.
We have put this to draft because although the current code is useful, it is currently broken for unknown reasons. The problem is low priority however, so we will keep this open for a later time.
I am going to close this since it has been open for so long and the code has changed significantly in between.