aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

`verdi status` deletes PID file of running daemon

Open ltalirz opened this issue 3 years ago • 7 comments

Describe the bug

When running verdi status, the condition

https://github.com/aiidateam/aiida-core/blob/2b6b2e9f8797c1b91e2efdbdff7fbb82db5084bf/aiida/cmdline/utils/daemon.py#L130-L131

is triggered, which results in the circus PID file being deleted, killing the running circus process.

This is because process.cmdline() is ['circusd', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', ''] and _START_CIRCUS_COMMAND is start-circus.

Steps to reproduce

I'll have to figure out exactly what is causing this - I have a python 3.9 environment that does not show the issue, and a pythohn 3.8.12 environment that does (both using the same AiiDA profile).

On the environment where it runs fine, process.cmdline() is ['/path/to/python3.9', '/path/to/verdi', '-p', 'local', 'daemon', 'start-circus', '1']

Expected behavior

  1. PIDs of running daemons should not be deleted (note: this only kills circus; the workers continue running and accumulating!).

  2. Maybe we should log the stack trace of the exception that was thrown in the daemon log (that would have simplified debuging).

Your environment

  • Operating system [e.g. Linux]: ubuntu 20.04
  • Python version [e.g. 3.7.1]: 3.8.12
  • aiida-core version [e.g. 1.2.1]: 1.6.5

Additional context

ltalirz avatar Jan 27 '22 22:01 ltalirz

It turns out that the error occurs

  • when the daemon was started in the py38 environment and one runs verdi status in either the py39 or the py38 environment
  • when the daemon was started in the py39 environment and one runs verdi status in the py38 environment

ps -ef | grep -i circus does show that

  • starting the daemon on py39 has the command line '/path/to/python3.9', '/path/to/verdi', '-p', 'local', 'daemon', 'start-circus', '1'
  • starting the daemon on py38 has the command line circusd, which is bound to cause the downstream issue

Further notes:

  • Both environments are running circus 0.17.1
  • pip check runs fine

ltalirz avatar Jan 27 '22 23:01 ltalirz

starting the daemon on py38 has the command line circusd

@sphuber @chrisjsewell do you happen to have any ideas where this could originate from?

ltalirz avatar Jan 27 '22 23:01 ltalirz

Honestly I have no idea. You say the only thing that is different is the Python version. Version of circus is identical. Did you create the environments in a similar way? Is this all conda for example? Why does the process.cmdline() on py38 have so many empty strings? Did you just remove those to hide sensitive data or is that actually what is returned?

I created a virtual environment with Python 3.8 on Ubuntu 20.04 and cannot reproduce the behavior:

Python 3.8.10 (default, Nov 26 2021, 20:14:08) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import psutil

In [2]: psutil.Process(12651).cmdline()
Out[2]: 
['/home/sph/.virtualenvs/aiida_py38/bin/python',
 '/home/sph/.virtualenvs/aiida_py38/bin/verdi',
 '-p',
 'v1.6.5',
 'daemon',
 'start-circus',
 '1']

The command looks as expected and verdi status therefore does not touch the PID file of the running daemon.

sphuber avatar Jan 28 '22 09:01 sphuber

Thanks @sphuber for getting back to me!

Is this all conda for example?

Yes

Why does the process.cmdline() on py38 have so many empty strings? Did you just remove those to hide sensitive data or is that actually what is returned?

This is the actual return value. As mentioned, also ps -ef shows circusd as the command that is being run.

Did you create the environments in a similar way?

There are additional packages in the "bad" environment, which I cannot all show (pip check did not show any issues, though).

I guess I'll need to look at where the circus daemon is started; perhaps some dependency of circus is not working as expected.

ltalirz avatar Jan 28 '22 10:01 ltalirz

Just received another report of this, this time on python 3.9.15 in ubuntu 18.04.6 LTS (Linux kernel 4.15.0). Again, the respective process shows up in ps -ef as circusd

ltalirz avatar Jan 03 '23 16:01 ltalirz

This difference in the appearance of the circus process is independent of AiiDA.

Here is a minimal example to reproduce this (still not clear what exactly is causing the difference):

test.conf:

[watcher:program]
cmd = python myprogram.py
numprocesses = 2

myprogram.py:`

import time

time.sleep(100)
$ circusd test.conf
2023-01-09 18:29:08 circus[3750] [INFO] Installing handle_callback_exception to loop
2023-01-09 18:29:08 circus[3750] [INFO] Registering signals...
2023-01-09 18:29:08 circus[3750] [INFO] Starting master on pid 3750
2023-01-09 18:29:08 circus[3750] [INFO] Arbiter now waiting for commands
2023-01-09 18:29:08 circus[3750] [INFO] program started

The master process shows up in ps -ef | grep circus on one machine as

/path/to/circusd test.conf

and as

circusd

on the other

ltalirz avatar Jan 09 '23 18:01 ltalirz

PR #5862 added a workaround but ideally we figure out why the command name changes on certain systems and if possible find a more robust check that doesn't suffer from this platform-dependency.

sphuber avatar Jan 26 '23 08:01 sphuber