aiida-core `verdi daemon start` claims to fail but, in fact, is successful.

Like a miserable aggressively-humble inhabitant of the UK and Ireland, verdi daemon start claims failure while actually being successful.

When starting the daemon on my system, it reliably fails the first time, every time. Running verdi daemon start a second time after a failure works as expected. This a fresh install of v1.1.0.

If one monitors ps -e | grep verdi directly after a failure, the following process is spawned:

59170 ??         0:00.04 /Users/conrad/.virtualenvs/aiidapy/bin/python3.7 /Users/conrad/.virtualenvs/aiidapy/bin/verdi -p Testing daemon start-circus 1

If one waits a moment and then searches the processes again, the following is also shown:

59189 ??         0:01.08 /Users/conrad/.virtualenvs/aiidapy/bin/python3.7 /Users/conrad/.virtualenvs/aiidapy/bin/verdi -p Testing devel run_daemon

Seemingly the check for success is happening before the daemon worker has a chance to spin up and so fails spuriously.

Feb 24 '20 16:02 ConradJohnston

Have you experienced this since @ConradJohnston ? Is this a regular occurrence and/or can you make it happen deterministically? If so it would be good to try and add an additional printing line in the command itself to print the full output based on which we say whether the starting of the daemon was successful or failed. This should be in aiida.cmdline.commands.cmd_daemon:start. We call a subprocess and if that raises, we print [FAILED]. It would be good to also print the full stack trace in that case.

Apr 14 '20 06:04 sphuber

also see here

Apr 14 '20 06:04 hongyi-zhao

@hongyi-zhao Since you encountered this issue as well, it would be great if you could help pin down the source of the error.

E.g. replace this line https://github.com/aiidateam/aiida-core/blob/67ae49f37055eaa69aa22c61afdf7802b09e73f3/aiida/cmdline/commands/cmd_daemon.py#L67 by

import traceback
echo.echo_critical(traceback.format_exc())

Apr 14 '20 08:04 ltalirz

With the above code changes, I cannot see any difference about the output messages of this problem.

I also noted the fact the daemon starting process maybe a time-consuming operation while the checker process may be invoked too early to give correct results. If running in the blocking mode with --foreground option, this error message will disappear, so nothing is wrong. See my following steps for detail.

On the other hand, I've also noticed that the verdi daemon start command should be the first run after the computer was (re)started, otherwise, it maybe not fail to reproduce the problem for verdi daemon start.

See my following testing results:

In one terminal, activate the AiiDA environment, run the following command and keep it open:

(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi daemon start --foreground
Starting the daemon...

In the other terminal, activate the AiiDA environment using the same profile and test it as following:

(aiida-quantum-mobile) werner@ubuntu-01:~$ pgrep -x verdi -a 
25467 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/python3.6 /home/werner/.pyenv/versions/aiida-quantum-mobile/bin/verdi daemon start --foreground
25534 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/python3.6 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/verdi -p aiida daemon start-circus --foreground 1
25824 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/python3.6 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/verdi -p aiida devel run_daemon
(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi daemon status 
Profile: aiida
Daemon is running as PID 25534 since 2020-04-14 17:20:29
Active workers [1]:
  PID    MEM %    CPU %  started
-----  -------  -------  -------------------
25824    0.138        0  2020-04-14 17:20:31
Use verdi daemon [incr | decr] [num] to increase / decrease the amount of workers
(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi status 
 ✓ config dir:  /home/werner/.aiida
 ✓ profile:     On profile aiida
 ✓ repository:  /home/werner/.aiida/repository/aiida

(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi status 
 ✓ config dir:  /home/werner/.aiida
 ✓ profile:     On profile aiida
 ✓ repository:  /home/werner/.aiida/repository/aiida
 ✓ postgres:    Connected as aiida@localhost:5432
 ✓ rabbitmq:    Connected to amqp://127.0.0.1?heartbeat=600
 ✓ daemon:      Daemon is running as PID 25534 since 2020-04-14 17:20:29

Testing environment:

(aiida-quantum-mobile) werner@ubuntu-01:~$ lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 19.10
Release:	19.10
Codename:	eoan


(aiida-quantum-mobile) werner@ubuntu-01:~/Public/hpc/tools/aiida/aiida-core.git$ pip list | grep aiida-core
aiida-core                      1.1.1        /home/werner/Public/hpc/tools/aiida/aiida-core.git


werner@ubuntu-01:~/Public/hpc/tools/aiida/aiida-core.git$ git diff
diff --git a/aiida/cmdline/commands/cmd_daemon.py b/aiida/cmdline/commands/cmd_daemon.py
index 0a59c39f8..3cd859d2f 100644
--- a/aiida/cmdline/commands/cmd_daemon.py
+++ b/aiida/cmdline/commands/cmd_daemon.py
@@ -64,7 +64,10 @@ def start(foreground, number):
         subprocess.check_output(command, env=currenv, stderr=subprocess.STDOUT)  # pylint: disable=unexpected-keyword-arg
     except subprocess.CalledProcessError as exception:
         click.secho('FAILED', fg='red', bold=True)
-        echo.echo_critical(exception.output)
+        #echo.echo_critical(exception.output)
+        # https://github.com/aiidateam/aiida-core/issues/3800
+        import traceback
+        echo.echo_critical(traceback.format_exc())
 
     # We add a small timeout to give the pid-file a chance to be created
     with spinner():

Regards

Apr 14 '20 08:04 hongyi-zhao

Just mentioning that we recently encountered an issue, where verdi daemon restart would time out, while (when checking again after a couple of seconds) the daemon would actually have restarted successfully.

In this particular example we know that the command run by the daemon worker was stuck (it was waiting for input). I.e. it may be that the timeout on the verdi daemon restart side may be shorter than (or equal to) the timeout related to the killing of the daemon worker process. Discussed this with @sphuber, who may have a look into this when he finds time.

Sep 07 '22 12:09 ltalirz

I cannot reproduce it (neither for verdi daemon start nor verdi daemon restart), but looked at the code and the two phenomena are not necessarily the same. For the verdi daemon start case, it launches verdi daemon start-circus in a subprocess and if that returns a non-zero exit status, the FAILED is printed. However, for verdi daemon restart, the circus daemon is already running and we send a signal to it through the CircusClient asking it to restart the workers. The client call has a timeout, which is what was being observed in the case mentioned by @ltalirz .

Sep 08 '22 07:09 sphuber

Just encountered another case of verdi daemon start returning FAILED (but actually starting the daemon successfully) with aiida-core 2.1.2. I'm not able to reproduce it, unfortunately.

Independently of this:

I also notice that, in a setup where the .aiida directory is shared between machines, trying to start the daemon on one machine, while another machine is already running the daemon, results in a different error message:

$ verdi daemon start
Starting the daemon with 1 workers... Critical: stopped

(without any further explanation of what happened).

In this scenario, verdi daemon status returns

$ verdi daemon status
Profile: quicksetup
The daemon is paused

See code: https://github.com/aiidateam/aiida-core/blob/e83bacc87a832efbe5227bb0d8de1593d0846e6d/aiida/cmdline/utils/daemon.py#L61-L62

A user receiving this message will wonder a) what "paused" means, given that there is no verdi daemon pause / verdi daemon play command b) how "paused" is different from "the daemon is not running"

I believe we should provide some more context to the user here.

Jan 03 '23 14:01 ltalirz

Yeah, I agree, having it say that it is paused makes no sense and also confused me. I have never encountered that state though.

I think this scenario (same .aiida folder used by multiple systems) requires a bit more care. In essence, the instance that is not running the daemon and that calls verdi daemon status can still communicate with the daemon since it goes over sockets. However, it really shouldn't be allowed to, since the circus process is not its process. So I guess we should add a check in the client that the PID file corresponds to a process it owns and otherwise it cannot use the sockets.

Jan 16 '23 11:01 sphuber

Just adding this here, since the issue is similar. With aiida-core==2.4.1 I'm running into the following:

❯ verdi daemon restart --reset --timeout 20
Profile: dev
Stopping the daemon... OK
Starting the daemon with 1 workers... FAILED
Critical: The daemon failed to start or is unresponsive after 2 seconds.

The message is pretty clear that it's also possible that the daemon started but is simply unresponsive, but I would expect that it waits for 20 seconds due to the --timeout setting

  -t, --timeout INTEGER           Time in seconds to wait for a response
                                  before timing out.

I then configured the default timeout to 20 seconds and all was well:

❯ verdi config set daemon.timeout 20
Success: 'daemon.timeout' set to 20 for 'dev' profile
❯ verdi daemon restart --reset
Profile: dev
Stopping the daemon... OK
Starting the daemon with 1 workers... OK

So two suggested action points would be:

Fix the --timeout not being used properly, unless I misunderstand it's purpose?
Increase the default timeout to e.g. 5 or 10.

Nov 15 '23 21:11 mbercx

Thanks @mbercx . Since this is a different issue, could you please open a new issue for this. The daemon has been refactored significantly since the original issue and I am not sure it applies any longer, so I will be closing this one

Nov 16 '23 08:11 sphuber

aiida-core aiida-core copied to clipboard

`verdi daemon start` claims to fail but, in fact, is successful.

aiida-core
aiida-core copied to clipboard