salt
salt copied to clipboard
[BUG] service.running hangs for start/restart of daemons with poor fd hygiene
Description
Salt's service state as of 3005 still hangs when handling start or restart for some daemons. This defect was noted as recently as August 2022 (https://github.com/saltstack/salt/issues/56231#issuecomment-1222559491).
I've provided a test case using salt's underlying code paths which demonstrates at least the core of what's happening; it runs fine on both a Linux and FreeBSD machine. This seems to come up for people with some regularity especially around things like PostgreSQL, which has a somewhat peculiar way of starting up. It came up for my with rspamd.
In short, salt's service module relys on salt's standard command runner, which in turn just uses subprocess.Popen.Communicate()
for subprocesses. That's great for one-off executions but a problem for daemonization, where salt can start trying to read from the subprocess's stdout before daemonization is complete (and presumably sometimes instead of calling wait()
? in the process tree for service.running
states I'd see salt waiting on a defunct process. . . which usually is what happens when you don't call wait()
).
Anyway. I suspect any system that's not using systemd is going to have this problem for at least some daemons. The easiest fix is making the service module invoke cmd
with a timeout, but you could also double down on your threading I guess and wait()
on the child process while also trying to communicate()
. . . I suppose.
So, I hope that's a help. It's a bit of a bummer to see the last mention of this just get closed after 2 years even as people are having the problem still. . . after the same auto-close happened earlier, too:
- closed in 2020: https://github.com/saltstack/salt/issues/56231
- also closed in 2020: https://github.com/saltstack/salt/issues/44848#issuecomment-624489416
See also:
- An early characterization of what happens, including replies from Hatch: https://groups.google.com/g/salt-users/c/wymM8NrslNw
- https://github.com/saltstack-formulas/postgres-formula/pull/264/files
- https://bugs.python.org/issue23213
Setup
- [x] VM (kvm)
Steps to Reproduce the behavior
Test case, using sources below:
CLOSE_FDS=1 time python3 ./salt_run.py # works fine, exits immediately
CLOSE_FDs=0 time python3 ./salt_run.py # hangs until the daemonized process exists
Or, try this on a FreeBSD VM with a daemon like rspamd
or postgresql
.
Expected behavior
Salt's service.running
state does not hang indefinitely or until one runs service XXX restart
for non-systemd machines and for services where daemonization doesn't work as expected.
Screenshots
N/A
Versions Report
salt --versions-report
(Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)Salt Version:
Salt: 3005
Dependency Versions:
cffi: 1.15.1
cherrypy: Not Installed
dateutil: Not Installed
docker-py: Not Installed
gitdb: Not Installed
gitpython: Not Installed
Jinja2: 3.0.1
libgit2: Not Installed
M2Crypto: Not Installed
Mako: Not Installed
msgpack: 1.0.4
msgpack-pure: Not Installed
mysql-python: Not Installed
pycparser: 2.21
pycrypto: Not Installed
pycryptodome: 3.15.0
pygit2: Not Installed
Python: 3.9.16 (main, Dec 17 2022, 01:13:33)
python-gnupg: Not Installed
PyYAML: 5.4.1
PyZMQ: 22.3.0
smmap: Not Installed
timelib: Not Installed
Tornado: 4.5.3
ZMQ: 4.3.4
System Versions:
dist: freebsd 13.1
locale: utf-8
machine: amd64
release: 13.1-RELEASE
system: FreeBSD
version: FreeBSD 13.1
Additional context
Test case sources:
==> daemonize.py <==
#!/usr/bin/env python3
"""
Demonstrates a daemonizing process that either does or doesn't close
its parent's stdin/stdout/stderr before starting work.
CLOSE_FDS=1 ./daemonize.py # close fds
CLOSE_FDS=1 ./daemonize.py # don't close fds
"""
import os
import sys
import time
# Riffed from
# http://web.archive.org/web/20131017130434/http://www.jejik.com/articles/2007/02/a_simple_unix_linux_daemon_in_python/
# but it's just the usual daemonization (fork, setsid, fork, maybe close
# fds, and start the work.)
def daemonize(close_fds):
pid = os.fork()
if pid > 0:
sys.exit(0) # exit first parent
# decouple from parent env
os.setsid()
pid = os.fork()
if pid > 0:
sys.exit(1) # exit second parent
if close_fds:
# We have to close both stdout & stderr b/c
# cmdmod.py's retcode() aliases stderr=subprocess.STDOUT.
sys.stdout.flush()
sys.stderr.flush()
fw = open('/dev/null', 'wb')
os.dup2(fw.fileno(), sys.__stdout__.fileno())
os.dup2(fw.fileno(), sys.__stderr__.fileno())
# We don't seem to need to close stdin, though, in order for
# salt_run.py to terminate.
# fr = open('/dev/null', 'rb')
# os.dup2(fr.fileno(), sys.__stdin__.fileno())
print("sleeping for 5s", file=sys.stderr)
try:
time.sleep(5) # pretend our daemon only ran for 5s
except Exception as e:
print("error: %s" % e, file=sys.stderr)
raise
print("done", file=sys.stderr)
if __name__ == '__main__':
close_fds = int(os.environ.get('CLOSE_FDS', '0'))
daemonize(close_fds)
==> salt_run.py <==
#!/usr/bin/env python
"""
Test that salt's timed_subprocess hangs when running a daemon process
that doesn't close off its stdin/stdout/stderr.
Usage:
CLOSE_FDS=1 ./salt_run.py # works fine
CLOSE_FDS=0 ./salt_run.py # hangs
"""
import subprocess
import sys
import salt.utils.timed_subprocess
proc = salt.utils.timed_subprocess.TimedProc(
# use our executable since FreeBSD names python3 python39, etc.
'%s ./daemonize.py' % sys.executable,
shell=True,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT)
proc.run()
print('done!')