shell command does not terminate on cancel
As a simplified example, I would like to:
- start a shell command
- wait some time
- stop it
The problem is that a child subprocesses from the shell won't terminate on cancel and would keep running even after python process is finished.
The problem is present only when a command consists of several processes, like ping localhost | cat, but not when a command is simple, like sleep 20.
Tested with: ubuntu-24.04, python 3.12.8, trio 0.29.0
Steps to reproduce:
- check for currently running
pingprocesses:ps aux | grep ping(confirm nopingprocess is running) - run the test script (see below)
python ./test.py - check list of processes again, while the test script is running and after the test script terminates
Expected result is that a shell + ping + cat are all running only for 5 seconds.
Actual result that a shell process is running for 5 seconds, but the ping and also cat processes remain running not just beyond 5 seconds, but even after a script terminates. It looks like the subprocesses get detached from the shell and are not considered childs any more. I am not sure if this is 1 or 2 separate problems. I am not even sure if this is a proper way to run a shell script from within trio.
In any case, is there a workaround or a better way to run shell commands?
test.py script:
import trio
from subprocess import DEVNULL
# cmd = 'sleep 20' # this is properly stopped
cmd = 'ping localhost | cat' # this command won't stop properly
async def test():
async with trio.open_nursery() as ns:
async def service_runner(task_status = trio.TASK_STATUS_IGNORED):
with trio.CancelScope() as scope:
task_status.started(scope)
await trio.run_process(cmd, shell=True, check=False,
stdout=DEVNULL, stderr=DEVNULL)
ns.cancel_scope.cancel()
service_scope = await ns.start(service_runner)
await trio.sleep(5.0)
service_scope.cancel()
await trio.sleep(5.0)
trio.run(test)
Reading the docs page about run_process, it is implied that on unix systems, processes are sent the SIGTERM signal and after 5 seconds SIGKILL is sent. If you want to immediately kill subprocesses, you might want to change deliver_cancel, though I've not worked with sub processes extensively before and I might be missing something.
@CoolCat467 thanks for the suggestion.
I have tryed to use deliver_cancel to force stop the process, but it does not help.
This is a new test scenario. It shows that deliver_cancel actually gets executed and the (shell) process had actually stopped. But it looks like childs of a shell did not receive termination signal. The ping process still gets orphaned and remains running.
Do you have any other idea?
import trio
from subprocess import DEVNULL
# cmd = 'ping localhost'
cmd = 'ping localhost | cat'
async def test():
async def stop_process(process):
print('stopping...')
process.kill()
print('waiting...')
await process.wait()
print('done...')
async with trio.open_nursery() as ns:
async def service_runner(task_status = trio.TASK_STATUS_IGNORED):
with trio.CancelScope() as scope:
task_status.started(scope)
await trio.run_process(cmd, shell=True, check=False,
stdout=DEVNULL, stderr=DEVNULL,
deliver_cancel=stop_process)
ns.cancel_scope.cancel()
service_scope = await ns.start(service_runner)
await trio.sleep(5.0)
service_scope.cancel()
await trio.sleep(5.0)
trio.run(test)
This isn't trio-specific. I haven't tried looking more but I suspect /bin/sh doesn't propagate signals through pipes.
alice@halimede /t/tmp.9WFtkHEbrd> cat ex2.py
import subprocess
import time
cmd = 'while true; do\necho "h"\nsleep 1\ndone | cat'
proc = subprocess.Popen(cmd, shell=True)
time.sleep(1)
proc.kill()
alice@halimede /t/tmp.9WFtkHEbrd> python ex2.py
h
h
alice@halimede /t/tmp.9WFtkHEbrd> h
h
h
On Python 3.11+ (on Unix at least) you can do this:
alice@halimede /t/tmp.9WFtkHEbrd> cat ex2.py
import subprocess
import time
import os
import signal
cmd = 'while true; do\necho "h"\nsleep 10\ndone | cat'
proc = subprocess.Popen(cmd, shell=True, process_group=0)
time.sleep(1)
os.killpg(proc.pid, signal.SIGTERM)
proc.wait()
alice@halimede /t/tmp.9WFtkHEbrd> python ex2.py
h
alice@halimede /t/tmp.9WFtkHEbrd>
and for earlier I think you need to use like preexec_fn with like os.setpgid(0, 0). Maybe trio should do this by default if shell=True (on Unix).
@A5rocks I am not exactly sure how to translate your suggestion to current trio defaults. Do I suppose to use trio.run_process or any other low level method? I would appreciate a short trio example to demonstrate start/stop of a shell pileline.
This should work, though I haven't tested it:
import trio
import os # changed
import signal # changed
from subprocess import DEVNULL
# cmd = 'ping localhost'
cmd = 'ping localhost | cat'
async def test():
async def stop_process(process):
print('stopping...')
os.killpg(process.pid, signal.SIGTERM) # changed
print('waiting...')
await process.wait()
print('done...')
async with trio.open_nursery() as ns:
async def service_runner(task_status = trio.TASK_STATUS_IGNORED):
with trio.CancelScope() as scope:
task_status.started(scope)
await trio.run_process(cmd, shell=True, check=False,
stdout=DEVNULL, stderr=DEVNULL,
deliver_cancel=stop_process, process_group=0) # changed
ns.cancel_scope.cancel()
service_scope = await ns.start(service_runner)
await trio.sleep(5.0)
service_scope.cancel()
await trio.sleep(5.0)
trio.run(test)
@A5rocks thanks. This approach works.
It is sufficient for my application. Let me know I you want me to close the issue or keep it open as a reminder to improve trio. Either by better defaults or at least documentation update to highlight the problem and a solution.
There is another observation. trio correctly stops all subprocesses on CTRL-C on all cases, either during the first 5 seconds or during the second 5 seconds interval. The behaviour is correct even with the original example (that is: even after the subprocess was already improperly canceled). This makes me believe that there is another problem, namely the CTRL-C does better (expected) cleanup than regular application exit. Not sure if this observation is related to the original issue.
I guess I should explain what I think the problem is.
The issue is that process.kill() kills a process -- and processes don't have to propagate things to subprocesses. Instead, we can put them all in a group and this means that we can kill the group, which sends a signal to every member.
The reason why Ctrl+C works by default is that subprocesses by default share a process group with the process that spawned them. So, the shell has the same process group as the python program. When you press Ctrl+C, every process in the process group receives a SIGINT. We can't use this for process.kill() because we don't want to kill the current process; just the subprocesses. Hence the process_group=0 to create a new process group (0 = "use my pid for pgid").
(I'm not sure where "it worked without pipes" fits into this. I assume the shell does some exec if there's no pipes and sub processes if there are? That's the main bit that makes me couch this explanation in "what I think the problem is")
I think trio should improve its defaults here:
- make
shell=Truesubprocesses passprocess_group=0by default unless that overrides something - if
process_groupis in the kwargs, then the defaultdeliver_cancelshould useos.killpg(os.getpgid(process.pid), signal) - probably also backport
process_groupby updatingpreexec_fnon pre-3.11
Though we should make sure we don't make it worse in response to ctrl+c (since the subprocess is no longer in the process group I don't think they automatically received SIGINT. This is probably the correct behavior though? We want them to go through typical shutdown behavior as deliver_cancel specifies.)
https://biriukov.dev/docs/fd-pipe-session-terminal/3-process-groups-jobs-and-sessions/ explains what’s going on here.
Not an expert, but I understand enough to say this is not a Trio-specific issue, nor a shell=True / pipe specific issue (though this does often come up in that use case).
Trio can’t in general enforce structured concurrency across the subprocess boundary: It can’t stop an arbitrary subprocess from e.g. spawning its own subprocess in a new session that Trio doesn’t know about / can't supervise, and which outlives the ancestor Trio program.
But in the common enough case that you control all the descendent subprocesses you’re spawning with Trio (and can be make sure they don't misbehave), you can use the process group trick mentioned above along with os.killpg to honor structured concurrency. It’s a subtle enough yet common enough use case that it probably deserves at least a recipe in the docs, if not also some higher-level API support.
Would you happen to know why this only works on 3.12 and 3.14, not 3.13 or anything less than 3.12? (I'll spend some time debugging this at some point but this specific matrix of success makes no sense and yet it persists across reruns): https://github.com/A5rocks/trio/commit/c7991d4f754762ec3f604aadbf59c19c57e5c8ca
Here's the CI run: https://github.com/A5rocks/trio/actions/runs/14330484295
I haven't had time to look into the CI failures, sorry.
I did put together a (not-Trio-specific) killpg demo though. In case it's helpful: https://github.com/ctcjab/killpg-demo
Turns out it was just a race condition in the test, unless my tests locally were wrong. Guess we have to give a process some time to die? (I'd rather if we only exited the nursery if it was indeed dead, but I'm not sure that's something Trio can do)