aiida-core Respect disabled computer and set processes to pause

Respect disabled computer and set processes to pause

Open dev-zero opened this issue 4 years ago • 9 comments

simple fix for #1550

Dec 12 '19 16:12 dev-zero

ok, it seems to be a bit trickier: when submitting a job using verdi run submit.py, the process will hang. When trying to kill it, it will hang again and does not terminate on ctrl+c or SIGTERM.

$ verdi run submit.py 
^C12/12/2019 10:20:43 PM <404578> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 29
^C12/12/2019 10:21:58 PM <404578> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 29
[...]

Dec 12 '19 21:12 dev-zero

ok, it seems to be a bit trickier: when submitting a job using verdi run submit.py, the process will hang. When trying to kill it, it will hang again and does not terminate on ctrl+c or SIGTERM.

And the same script can be ran and then killed with CTRL+C in develop so without your changes?

Dec 12 '19 21:12 sphuber

ok, it seems to be a bit trickier: when submitting a job using verdi run submit.py, the process will hang. When trying to kill it, it will hang again and does not terminate on ctrl+c or SIGTERM.

And the same script can be ran and then killed with CTRL+C in develop so without your changes?

yes

Dec 13 '19 08:12 dev-zero

This looks ready to go, thanks a lot @dev-zero . Also thanks for adapting the doc string of the upload task that I sloppily forgot about :) I will try today to test it out in the wild as well. The only thing that still worries me a bit is your report that you could not kill a running process properly with these changes. I have not been able to reproduce this. Could you maybe still share what you ran in that submit.py script? Or have the problems gone away since?

Dec 18 '19 09:12 sphuber

Some minor changes. I checked out your branch locally and launched a calculation job with run and then interrupted it and in my case it works just fine. The process ends up in the killed state. Can you share the submit.py contents?

The following runs without problems without the patch (well, obvious) and hangs with it:

from aiida.plugins import CalculationFactory
from aiida.engine import run
from aiida import orm

code = load_code(14)

inputs = {
        'code': code,
        'x': orm.Int(1),
        'y': orm.Int(1),
        'metadata': {
            'options': {
                'resources': {'num_machines': 1, 'num_mpiprocs_per_machine': 1}
                }
            }
        }

result = run(CalculationFactory('arithmetic.add'), **inputs)

what I get:

(aiida) tiziano@scarlet ~/work/aiida $ verdi run submit.py
(aiida) tiziano@scarlet ~/work/aiida $ verdi process list -a
  PK  Created    Process label             Process State    Process status
----  ---------  ------------------------  ---------------  -----------------------------------------------------------
  [...]
  57  10s ago    ArithmeticAddCalculation  ⏹ Finished [0]
Info: last time an entry changed state: 9s ago (at 09:14:20 on 2019-12-18)
(aiida) tiziano@scarlet ~/work/aiida $ verdi computer disable localhost [email protected]
Info: Computer 'localhost' disabled for user Tiziano Müller ([email protected]).
(aiida) tiziano@scarlet ~/work/aiida $ verdi run submit.py 
|
^C12/18/2019 10:25:46 AM <512974> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 63
12/18/2019 10:27:26 AM <512974> aiida.engine.runners: [CRITICAL] runner received interrupt, killing process 63
Killed

... first CRITICALis from Ctrl+c, second one from sending SIGTERM, then SIGKILL.

Dec 18 '19 09:12 dev-zero

When you say that the script runs normally without and hangs with patch, is that when you just run it and let it run? And is that with the computer disabled? Or, finally, do you only see the difference if you CTRL+C it?

If you see the hanging with the patch when you disable the computer, I think I know why. In that case the process is paused and even for the engine to kill it, it needs to be running first. If this is the case, we might need to have a kill action automatically resume a paused process first. However, I also notice that you decorated the task_kill_job. I realize we decided this because the disable should also be used by users to indicate that the machine is unreachable and the engine should not even try, but this then makes it obvious that a kill command will hang in a blocking run until the computer is enabled and the process is resumed. I am not sure yet where to "fix" this impasse.

Dec 18 '19 09:12 sphuber

When you say that the script runs normally without and hangs with patch, is that when you just run it and let it run? And is that with the computer disabled? Or, finally, do you only see the difference if you CTRL+C it?

What I try:

run the script on AiiDA develop with computer enabled. Works.
run the script on AiiDA develop with computer disabled. Works.
run the script on AiiDA this branch with computer enabled. Works.
run the script on AiiDA this branch with computer disabled. Hangs. Meaning the verdi run command does not terminate and can't be terminated by Ctrl+C or SIGTERM.

If you see the hanging with the patch when you disable the computer, I think I know why. In that case the process is paused and even for the engine to kill it, it needs to be running first. If this is the case, we might need to have a kill action automatically resume a paused process first. However, I also notice that you decorated the task_kill_job. I realize we decided this because the disable should also be used by users to indicate that the machine is unreachable and the engine should not even try, but this then makes it obvious that a kill command will hang in a blocking run until the computer is enabled and the process is resumed. I am not sure yet where to "fix" this impasse.

Yes, I came to the same conclusion. Intuitively I would have said that PAUSED-UPLOAD/SUBMIT/... need to be full (and visible) states in the state machine rather than some kind of metastate (paused meaning: awaiting state change in the state machine) as it seems to be now. And the reason why you need multiple states (or a hierarchical) state machine is because you have to remember where to go to on resume.

Submitting jobs to a disabled computer should generate a warning as requested in #664 but still end up in the queue and verdi run should return.

Dec 18 '19 10:12 dev-zero

run the script on AiiDA this branch with computer disabled. Hangs. Meaning the verdi run command does not terminate and can't be terminated by Ctrl+C or SIGTERM.

This is a tricky one. With the current design and intention of run it is supposed to hang. It is not supposed to return control to the interpreter until the process has finished. Since it is paused it "hangs". Now, I do think that it should still be interruptable so Ctrl+C should just kill the process and terminate the interpreter, so that is something to fix. But I do not think that if a process that is launched through run is paused, the interpreter should just close, because that process is lost forever. It won't ever be resumed by anything.

Dec 18 '19 10:12 sphuber

We have put this to draft because although the current code is useful, it is currently broken for unknown reasons. The problem is low priority however, so we will keep this open for a later time.

Sep 16 '20 09:09 sphuber

I am going to close this since it has been open for so long and the code has changed significantly in between.

Oct 31 '22 15:10 sphuber

aiida-core aiida-core copied to clipboard

Respect disabled computer and set processes to pause

aiida-core
aiida-core copied to clipboard