aiida-core
aiida-core copied to clipboard
`verdi daemon start` claims to fail but, in fact, is successful.
Like a miserable aggressively-humble inhabitant of the UK and Ireland, verdi daemon start
claims failure while actually being successful.
When starting the daemon on my system, it reliably fails
the first time, every time.
Running verdi daemon start
a second time after a failure works as expected.
This a fresh install of v1.1.0.
If one monitors ps -e | grep verdi
directly after a failure, the following process is spawned:
59170 ?? 0:00.04 /Users/conrad/.virtualenvs/aiidapy/bin/python3.7 /Users/conrad/.virtualenvs/aiidapy/bin/verdi -p Testing daemon start-circus 1
If one waits a moment and then searches the processes again, the following is also shown:
59189 ?? 0:01.08 /Users/conrad/.virtualenvs/aiidapy/bin/python3.7 /Users/conrad/.virtualenvs/aiidapy/bin/verdi -p Testing devel run_daemon
Seemingly the check for success is happening before the daemon worker has a chance to spin up and so fails spuriously.
Have you experienced this since @ConradJohnston ? Is this a regular occurrence and/or can you make it happen deterministically? If so it would be good to try and add an additional printing line in the command itself to print the full output based on which we say whether the starting of the daemon was successful or failed. This should be in aiida.cmdline.commands.cmd_daemon:start
. We call a subprocess and if that raises, we print [FAILED]
. It would be good to also print the full stack trace in that case.
also see here
@hongyi-zhao Since you encountered this issue as well, it would be great if you could help pin down the source of the error.
E.g. replace this line https://github.com/aiidateam/aiida-core/blob/67ae49f37055eaa69aa22c61afdf7802b09e73f3/aiida/cmdline/commands/cmd_daemon.py#L67 by
import traceback
echo.echo_critical(traceback.format_exc())
With the above code changes, I cannot see any difference about the output messages of this problem.
I also noted the fact the daemon starting process maybe a time-consuming operation while the checker process may be invoked too early to give correct results. If running in the blocking mode with --foreground option, this error message will disappear, so nothing is wrong. See my following steps for detail.
On the other hand, I've also noticed that the verdi daemon start
command should be the first run after the computer was (re)started, otherwise, it maybe not fail to reproduce the problem for verdi daemon start
.
See my following testing results:
In one terminal, activate the AiiDA environment, run the following command and keep it open:
(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi daemon start --foreground
Starting the daemon...
In the other terminal, activate the AiiDA environment using the same profile and test it as following:
(aiida-quantum-mobile) werner@ubuntu-01:~$ pgrep -x verdi -a
25467 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/python3.6 /home/werner/.pyenv/versions/aiida-quantum-mobile/bin/verdi daemon start --foreground
25534 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/python3.6 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/verdi -p aiida daemon start-circus --foreground 1
25824 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/python3.6 /home/werner/.pyenv/versions/3.6.9/envs/aiida-quantum-mobile/bin/verdi -p aiida devel run_daemon
(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi daemon status
Profile: aiida
Daemon is running as PID 25534 since 2020-04-14 17:20:29
Active workers [1]:
PID MEM % CPU % started
----- ------- ------- -------------------
25824 0.138 0 2020-04-14 17:20:31
Use verdi daemon [incr | decr] [num] to increase / decrease the amount of workers
(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi status
✓ config dir: /home/werner/.aiida
✓ profile: On profile aiida
✓ repository: /home/werner/.aiida/repository/aiida
(aiida-quantum-mobile) werner@ubuntu-01:~$ verdi status
✓ config dir: /home/werner/.aiida
✓ profile: On profile aiida
✓ repository: /home/werner/.aiida/repository/aiida
✓ postgres: Connected as aiida@localhost:5432
✓ rabbitmq: Connected to amqp://127.0.0.1?heartbeat=600
✓ daemon: Daemon is running as PID 25534 since 2020-04-14 17:20:29
Testing environment:
(aiida-quantum-mobile) werner@ubuntu-01:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 19.10
Release: 19.10
Codename: eoan
(aiida-quantum-mobile) werner@ubuntu-01:~/Public/hpc/tools/aiida/aiida-core.git$ pip list | grep aiida-core
aiida-core 1.1.1 /home/werner/Public/hpc/tools/aiida/aiida-core.git
werner@ubuntu-01:~/Public/hpc/tools/aiida/aiida-core.git$ git diff
diff --git a/aiida/cmdline/commands/cmd_daemon.py b/aiida/cmdline/commands/cmd_daemon.py
index 0a59c39f8..3cd859d2f 100644
--- a/aiida/cmdline/commands/cmd_daemon.py
+++ b/aiida/cmdline/commands/cmd_daemon.py
@@ -64,7 +64,10 @@ def start(foreground, number):
subprocess.check_output(command, env=currenv, stderr=subprocess.STDOUT) # pylint: disable=unexpected-keyword-arg
except subprocess.CalledProcessError as exception:
click.secho('FAILED', fg='red', bold=True)
- echo.echo_critical(exception.output)
+ #echo.echo_critical(exception.output)
+ # https://github.com/aiidateam/aiida-core/issues/3800
+ import traceback
+ echo.echo_critical(traceback.format_exc())
# We add a small timeout to give the pid-file a chance to be created
with spinner():
Regards
Just mentioning that we recently encountered an issue, where verdi daemon restart
would time out, while (when checking again after a couple of seconds) the daemon would actually have restarted successfully.
In this particular example we know that the command run by the daemon worker was stuck (it was waiting for input). I.e. it may be that the timeout on the verdi daemon restart
side may be shorter than (or equal to) the timeout related to the killing of the daemon worker process.
Discussed this with @sphuber, who may have a look into this when he finds time.
I cannot reproduce it (neither for verdi daemon start
nor verdi daemon restart
), but looked at the code and the two phenomena are not necessarily the same. For the verdi daemon start
case, it launches verdi daemon start-circus
in a subprocess and if that returns a non-zero exit status, the FAILED
is printed. However, for verdi daemon restart
, the circus daemon is already running and we send a signal to it through the CircusClient
asking it to restart the workers. The client call has a timeout, which is what was being observed in the case mentioned by @ltalirz .
Just encountered another case of verdi daemon start
returning FAILED
(but actually starting the daemon successfully) with aiida-core 2.1.2.
I'm not able to reproduce it, unfortunately.
Independently of this:
I also notice that, in a setup where the .aiida directory is shared between machines, trying to start the daemon on one machine, while another machine is already running the daemon, results in a different error message:
$ verdi daemon start
Starting the daemon with 1 workers... Critical: stopped
(without any further explanation of what happened).
In this scenario, verdi daemon status
returns
$ verdi daemon status
Profile: quicksetup
The daemon is paused
See code: https://github.com/aiidateam/aiida-core/blob/e83bacc87a832efbe5227bb0d8de1593d0846e6d/aiida/cmdline/utils/daemon.py#L61-L62
A user receiving this message will wonder
a) what "paused" means, given that there is no verdi daemon pause
/ verdi daemon play
command
b) how "paused" is different from "the daemon is not running"
I believe we should provide some more context to the user here.
Yeah, I agree, having it say that it is paused makes no sense and also confused me. I have never encountered that state though.
I think this scenario (same .aiida
folder used by multiple systems) requires a bit more care. In essence, the instance that is not running the daemon and that calls verdi daemon status
can still communicate with the daemon since it goes over sockets. However, it really shouldn't be allowed to, since the circus process is not its process. So I guess we should add a check in the client that the PID file corresponds to a process it owns and otherwise it cannot use the sockets.
Just adding this here, since the issue is similar. With aiida-core==2.4.1
I'm running into the following:
❯ verdi daemon restart --reset --timeout 20
Profile: dev
Stopping the daemon... OK
Starting the daemon with 1 workers... FAILED
Critical: The daemon failed to start or is unresponsive after 2 seconds.
The message is pretty clear that it's also possible that the daemon started but is simply unresponsive, but I would expect that it waits for 20 seconds due to the --timeout
setting
-t, --timeout INTEGER Time in seconds to wait for a response
before timing out.
I then configured the default timeout
to 20 seconds and all was well:
❯ verdi config set daemon.timeout 20
Success: 'daemon.timeout' set to 20 for 'dev' profile
❯ verdi daemon restart --reset
Profile: dev
Stopping the daemon... OK
Starting the daemon with 1 workers... OK
So two suggested action points would be:
- Fix the
--timeout
not being used properly, unless I misunderstand it's purpose? - Increase the default timeout to e.g. 5 or 10.
Thanks @mbercx . Since this is a different issue, could you please open a new issue for this. The daemon has been refactored significantly since the original issue and I am not sure it applies any longer, so I will be closing this one