Thespian
                                
                                
                                
                                    Thespian copied to clipboard
                            
                            
                            
                        ChildActorExited - Manually killing child PID does not result in ChildActorExited Message to Parent
Code
Below is example code which has reproduced this issue for me.
Project structure
.
├── app
│   ├── __init__.py
│   ├── parent.py
│   └── child.py
├── start.py
└── stop.py
start.py
from thespian.actors import ActorSystem
from app import parent
if __name__ == '__main__':
  asys = ActorSystem('multiprocTCPBase')
  actor = asys.createActor(parent.Parent)
  asys.tell(actor, parent.Initialize())
stop.py
from thespian.actors import ActorSystem
if __name__ == '__main__':
  ActorSystem('multiprocTCPBase').shutdown()
parent.py
from thespian.actors import ActorTypeDispatcher
import os
class Initialize:
  def __init__(self):
    pass
class Parent(ActorTypeDispatcher):
  def __init__(self, *args, **kwargs):
    super(Parent, self).__init__(*args, **kwargs)
    self._child = None
  def receiveMsg_Initialize(self, msg, sender):
    print(f'{self.myAddress} (P|:{os.getpid()}) Parent received Initialize command from {sender}')
    self._child = self.createActor('app.child.Child')
    self.send(self._child, Initialize())
  def receiveMsg_ChildActorExited(self, msg, sender):
    print(f'In Parent: ChildActorExited')
child.py
from thespian.actors import ActorTypeDispatcher
import os
class Child(ActorTypeDispatcher):
  def __init__(self, *args, **kwargs):
    super(Child, self).__init__(*args, **kwargs)
  def receiveMsg_Initialize(self, msg, sender):
    print(f'{self.myAddress} (P|:{os.getpid()}) Child received Initialize command from {sender}')
Procedure
- Navigate to the root of my project.
 
~$ cd ~/Documents/thespian/test
- Start the actor system using 
start.py 
~/Documents/thespian/test$ python start.py
ActorAddr-(T|:1111) (P|:5555) Parent received Initialize command from ActorAddr-(T|:0000)
ActorAddr-(T|:2222) (P|:6666) Child received Initialize command from ActorAddr-(T|:1111)
- Kill the child actor process by its PID.
 
~/Documents/thespian/test$ kill -9 6666
- After tests, shut down actor system.
 
~/Documents/thespian/test$ python stop.py
Problem
From the procedure above, upon killing the child process in step 3, it is to my understanding that the parent actor should immediately receive a ChildActorExited message and my example program should print out ChildActorExited to the terminal; this, however, does not happen. Instead the remaining parent actor will stay alive and report nothing.
I have tested this same functionality on MacOS and Windows with the same results. I also tried using multiprocUDPBase but again got the same results.
Another thing to note is after killing the child process and running stop.py, the actor system takes a bit longer than usual to shutdown however it does not print any additional information.
Environment
- These tests were performed in fresh virtual environments where the only pip package installed was 
thespian. - I have tested this on MacOS Monterey (12.1) as well as Windows 10 (20H2).
- On both operating systems, I have tested using various python versions (3.7.3, 3.8.6, 3.9.9).
 - I have also tested on a couple versions of Thespian (3.9.11, 3.10.6).
 
 
If you need any additional information, please let me know.
Thank you for the clear report with samples to check this.
Unfortunately, I cannot reproduce the problem: the ChildActorExited is delivered immediately for me when running  the tests you describe.
Python versions tested: 3.7.12, 3.8.12, 3.9.6
The main difference is that I'm running a Linux system (I don't presently have access to either a Windows or a MacOS system).
07:06:42 issue85$ export PS0="running command at start time \t\n"
running command at start time 07:06:46
07:06:46 issue85$ python start.py
running command at start time 07:06:50
ActorAddr-(T|:38691) (P|:1135063) Parent received Initialize command from ActorAddr-(T|:43313)
ActorAddr-(T|:42113) (P|:1135064) Child received Initialize command from ActorAddr-(T|:38691)
07:06:50 issue85$ kill 1135064
running command at start time 07:06:58
07:06:58 issue85$ In Parent: ChildActorExited
In Parent: ChildActorExited
07:06:59 issue85$ python stop.py
running command at start time 07:07:09
07:07:09 issue85$ 
07:07:14 issue85$ 
07:07:15 issue85$ python start.py
running command at start time 07:07:21
ActorAddr-(T|:34143) (P|:1135204) Parent received Initialize command from ActorAddr-(T|:36281)
ActorAddr-(T|:32769) (P|:1135205) Child received Initialize command from ActorAddr-(T|:34143)
07:07:21 issue85$ kill -9 1135205
running command at start time 07:07:25
07:07:25 issue85$ In Parent: ChildActorExited
07:07:27 issue85$ python stop.py
running command at start time 07:07:30
07:07:30 issue85$ 
Can you please try two things:
- Repeat your tests but without the 
-9argument tokill - Run the following and let me know the results with (a) 
python subproc.pyand no additional actions, (b)python subproc.pythenkillof the reported child process, and (c)python subproc.py+kill -9of the reported child process. It should report "Parent got signal 17" immediately on thekillcommands. 
subproc.py:
from multiprocessing import *
import signal
import time
import os
def sleeper(t):
    print('Process %s sleeping for %d seconds' % (os.getpid(), t))
    time.sleep(t)
    print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
    print('Parent got signal %s' % signum)
if __name__ == "__main__":
    wait_time = 10
    signal.signal(signal.SIGCHLD, gotsig)
    p = Process(target=sleeper, args=(wait_time,))
    p.start()
    time.sleep(wait_time + 2)
    print('Exiting')
                                    
                                    
                                    
                                
Repeat your tests but without the -9 argument to kill
When I do this, the process never gets killed and the actor stays alive. Both OSs exhibit the same behavior where I am prevented from killing ActorSystem child actor processes without forcefully killing them.
Note: In MacOS the signal number for SIGCHLD is 20 instead of 17.
Run the following and let me know the results with
python subproc.pyand no additional actions
Results:
~/test$ python subproc.py
    Process 42158 sleeping for 10 seconds
    Process 42158 awakened and exiting
    Parent got signal 20
    Exiting
~/test$
Run the following and let me know the results with
python subproc.pythen kill of the reported child process
Results:
~/test$ python subproc.py
    Process 53352 sleeping for 10 seconds
~/test$ kill 53352
    Parent got signal 20
    Exiting
~/test$
I received the exit signal immediately after killing the process.
Run the following and let me know the results with
python subproc.py+kill -9of the reported child process
Results:
~/test$ python subproc.py
    Process 56348 sleeping for 10 seconds
~/test$ kill -9 56348
    Parent got signal 20
    Exiting
~/test$
I received the exit signal immediately after killing the process.
I was unable to run the subproc.py tests on Windows since windows does not have any SIGCHLD signal or anything that I could use as a replacement.
I should have a Linux system up and running in the next couple days so I can get back to you with my test results on that, however it will be on an ARM Cortex processor so hopefully that doesn't throw a wrench into the mix. At least for now it seems like it should work and that this is more of a Mac and Windows specific issue.
This is certainly an odd issue... It almost feels like there may be a permissions issue or something. Are there any environment variables or permissions that I should verify on my end?
Additional environment info:
On the mac, I am using PyEnv to manage my multiple python versions.
On both mac and windows I am using python virtual environments created with python -m venv but I doubt that is the source of the issue.
Interesting... and a bit unexpected:
- The 
SIGCHLDvalue is usually17, not20, although that signal is not one of the explicit values listed in the POSIX standard, so it's possible it's implementation dependent. I get<Signals.SIGCHLD 17>when I run$ python -c 'import signal; print(repr(signal.SIGCHLD)). On my system,20isSIGTSTPwhich is a terminal stop signal (usually associated withCtrl-Zby default, although that's remappable elsewhere). What do you get for this python command? - Running kill without the 
-9should deliverSIGTERM(usually value15). What happens if you explicitly runkill -15 PID? - Can you run 
$ type killcommand? I would expect it to report "kill is a shell builtin". Also, what shell are you running? 
There seem to be some sort of shenanigans going on with signals that's causing the problem. Normally, thespian will watch for a child process exit by registering for a number of signals (see https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L678-L694 and https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L29-L40). Your results indicate that signals are being delivered promptly, but not necessarily the signal that was expected. It may be possible to resolve this by simply adding the right additional signal to https://github.com/kquick/Thespian/blob/master/thespian/system/multiprocCommon.py#L36, but I'd like to try to understand what's happening a little better first.
This situation should not be affected by the hardware architecture or the python virtual environment.
- I checked the reference man pages for OSX which provided the following:
 
     No    Name         Default Action       Description
     1     SIGHUP       terminate process    terminal line hangup
     2     SIGINT       terminate process    interrupt program
     3     SIGQUIT      create core image    quit program
     4     SIGILL       create core image    illegal instruction
     5     SIGTRAP      create core image    trace trap
     6     SIGABRT      create core image    abort program (formerly SIGIOT)
     7     SIGEMT       create core image    emulate instruction executed
     8     SIGFPE       create core image    floating-point exception
     9     SIGKILL      terminate process    kill program
     10    SIGBUS       create core image    bus error
     11    SIGSEGV      create core image    segmentation violation
     12    SIGSYS       create core image    non-existent system call invoked
     13    SIGPIPE      terminate process    write on a pipe with no reader
     14    SIGALRM      terminate process    real-time timer expired
     15    SIGTERM      terminate process    software termination signal
     16    SIGURG       discard signal       urgent condition present on
                                             socket
     17    SIGSTOP      stop process         stop (cannot be caught or
                                             ignored)
     18    SIGTSTP      stop process         stop signal generated from
                                             keyboard
     19    SIGCONT      discard signal       continue after stop
     20    SIGCHLD      discard signal       child status has changed
This is showing SICHLD on signal number 20. Here is the output of the requested command from your first point:
~/test$ python -c 'import signal; print(repr(signal.SIGCHLD))'
<Signals.SIGCHLD: 20>
- 
Running with
kill -15 PIDgives the same results askill PIDsince the default signal for the kill command in OSX is 15 (SIGTERM). On that note, in OSX, thekill -9 PIDcommand will deliver aSIGKILLto the process. - 
type killreturnskill is a shell builtin. I'm using ZSH on my Mac but I'm just using standard Powershell on my Windows computer. 
I'm looking through the code you linked and can't see why the exit signal isn't being caught since you're referencing it by name and not number.
I modified your test code in an attempt to view what signals are being received by the killed process.
from multiprocessing import *
import signal
import time
import os
def sleeper(t):
    print('Process %s sleeping for %d seconds' % (os.getpid(), t))
    time.sleep(t)
    print('Process %s awakened and exiting' % os.getpid())
def gotsig(signum, frame):
    print('Parent got signal %s' % signum)
if __name__ == '__main__':
    for i in ['SIGTERM', 'SIGKILL', 'SIGQUIT', 'SIGABRT', 'SIGCHLD']:
        try:
            signum = getattr(signal, i)
            signal.signal(signum, gotsig)
            print(f'{i}: {signum}')
        except (OSError, RuntimeError) as m: # os error
            print(f'Skipping {i}: {m}')
    
    wait_time = 10
    p = Process(target=sleeper, args=(wait_time,))
    p.start()
    time.sleep(wait_time+2)
    print('Exiting')
By running this code, I get the following output.
SIGTERM: 15
SIGKILL: 9
Skipping SIGKILL: [Errno 22] Invalid argument
SIGQUIT: 3
SIGABRT: 6
SIGCHLD: 20
Process 96152 sleeping for 10 seconds
Parent got signal 20
Exiting
Regardless of which signal I am killing the process with (SIGKILL: 9, SIGTERM: 15, SIGQUIT: 3, etc.) I get the same result signal in the output Parent got signal 20. I also did a bit of digging and found that for both Linux and Mac, their man pages specify the following:
The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.
This explains the invalid argument error for SIGKILL in my outputs.
Please let me know if there is any other info that may help.
Looks like only some of the signal values are defined by POSIX and the others can be remapped by different OS vendors (see https://en.wikipedia.org/wiki/Signal_(IPC) and search for "Portable number"), which is what accounts for the different numbers.
Thanks for trying the explicit -15 on the kill: I just wanted to check in case kill was mapped to a different default signal.
SIGKILL is a pretty emphatic termination signal and usually just used as a last resort, since the default SIGTERM is usually nicer in that it allows the application to perform any cleanup processing it might have setup via an associated signal handler.  Regardless, however the child exits, the parent should always get the SIGCHLD, which is what the latest script seems to verify.
I appreciate your patience in working through this with me. When you have a chance, the next set of tests is:
- Update your local thespian's 
thespian/system/transport/TCPTransport.pywith https://gist.github.com/kquick/371519aa6cb0bea16961cc100eb7cf3f THESPLOG_THRESHOLD=DEBUG THESPLOG_MAX_FILESIZE=1024000 THESPLOG_FILE=$(pwd)/run1.log python start.pyand then complete your original test process.- Send both 
run1.logto me (via gist or attachment here, after you've perused and tweaked them to remove any sensitive data you don't want exposed). 
That will give me some internal Thespian log information that should hopefully reveal what's different about running in your environment.
I've attached a few different logs here. All were generated by running the test code originally linked at the beginning of this issue.
kill_minus_15_SIGTERM.log
When running to generate kill_minus_15_SIGTERM.log, I attempted to kill the child process using kill -15 PID, which doesn't actually kill the process at all. After running the kill command, I ran stop.py to gracefully shut down the actorsystem.
kill_minus_9_SIGKILL.log
When running to generate kill_minus_9_SIGKILL.log, I used kill -9 PID to kill the child process. The child process was killed successfully but as expected didn't report back to the parent with a ChildActorExited message. I then gracefully shut down the actorsystem using stop.py.
I'm noticing in the kill_minus_15_SIGTERM.log file, on line 38, we get
2022-02-11 07:34:12.256399 p27057 Warn Actor app.child.Child @ ActorAddr-(T|:61345) got shutdown signal: 15
whereas no such line exists in the kill_minus_9_SIGKILL.log file.
I also tested using kill -3 and kill -6, both of which exhibited the exact same behavior as kill -15. They didn't result in the process being killed and the actor remained alive. They both did however show up in the log file saying got shutdown signal: 3 and got shutdown signal: 6
kill_minus_17_SIGSTOP.log
I then tested using kill -17 which is the SIGSTOP signal on MacOSX, which just like kill -9 (SIGKILL), "cannot be caught or ignored" according to the MacOSX and Linux man pages. This command behaved slightly differently from kill -9. After stopping (not killing since the process never actually died) the child actor process with SIGSTOP, then attempting to run stop.py to gracefully shut down the actorsystem, the shutdown process took longer than expected and resulted in the following error at the end of the log:
2022-02-11 07:55:48.015610 p46550 ERR  No response to Admin shutdown request; Actor system not completely shutdown
I think this is due to SIGSTOP essentially freezing a process which thespian recognizes as still being an existing process. So when thespian tries to kill the frozen process it gets stuck trying to kill the process but can't actually kill it and eventually gives up.
Quick update. On my Debian system I receive the ChildActorExited message as anticipated.
Thanks for the info and update. The log files are not showing some events I'm expecting to see. I'm looking into getting access to a MacOS environment to see if I can do some local testing/reproduction.
If there's something in particular you're looking for, I could do a bit of digging to get some more detailed info. Thanks again for your help. :)