tractor
tractor copied to clipboard
Do we need to detect KBI for debugging?
I've been running into this problem using the new debugger support from #129:
ctrl-cwhile already in debuggerSIGINTduring a request for the tty lock in the root-parent from a child process
In some cases this causes awkward crashing with errors other then the expected KBI and in the worst case the program actually hangs since the child causes deadlock trying to acquire the parent's tty lock (and the parent is waiting for the child to terminate - speaking of which this might be a good test as part of #145).
I got some tips in the trio gitter channel.
@oremanj said:
with the current Ctrl+C handling, I don't think it's possible -- you don't know the whole tree is getting cancelled, because the KI starts at whatever-the-main-task-is-doing and propagates upward cancelling one layer at a time. there's a somewhat stalled effort to improve this. you're probably best off implementing your own SIGINT handling
if you want a non-good way to directly detect a propagating KI exception, uh, I guess you can do something like:
task = trio.lowlevel.current_task()
nurseries_outward = task.child_nurseries[::-1]
while task.parent_nursery is not None:
nurseries_outward.append(task.parent_nursery)
task = task.parent_nursery.parent_task
for nursery in nurseries_outward:
if nursery.cancel_scope.cancel_called and any(isinstance(exc, KeyboardInterrupt) for exc in nursery._pending_excs:
return True
return False
@njsmith followed up with:
one quirk of SIGINT, if you're using subprocesses, is that usually the OS sends separate copies of the signal to all of your subprocesses
To which I suggested separate SIGINT handlers per proc, to which he responsed:
or else disable SIGINT in the child processes and rely on some other mechanism for propagating cancellation across process boundaries
@oremanj also mentioned:
the parent SIGINT will cancel
trio.run_processwhich will deliver the cancellation to the children using itsdeliver_cancelargument, you can modify that to send them a different signal, send some state via a non-signal mechanism, etc the default sends SIGTERM, waits, sends SIGKILL. yeah it sounds like you probably just want to block SIGINT in your subprocesses
So I think the solution might be best accomplished via the suggestion of ignoring SIGINT in child procs and using our cancel method call to accomplish that. Also, the deliver_cancel system @oremanj mentioned is on my list to check out.
Ok, so figured out what I think is the main bug to do with the #129 stuff.
- run a program with a
await tractor.breakpoint()in it - execute the
continuecommand to resume the program - use
ctrl-cto kill the process tree - the debugger (might) get(s) entered again
- execute
continueagain to hit the deadlock
This more often then not seems to result in some kind of deadlock where the parent is waiting on the child to terminate but the child is waiting on the lock from the parent somehow, I think.
Alright dug slightly into using deliver_cancel:
- it's only available when using
run_process()(which we're not). - if we wanted to implement something similar ourselves here are the impls
Disabled SIGINT in children and all other machinery seems to still work in the test suite; so I guess that's good?
It's still not solving the main issue with debugging where if you ctl-c from the pdbpp prompt and then continue the parent process is somehow getting in a super weird deadlock state with the children where the children aren't being cancelled. Killing the children manually using a system tool seems to then unblock the root actor and it tears down with KeyboardInterrupt. Might try disabling sigint handling while inside the debugger now.
I think the issue has been resolved by allocating a separate cancel scope for the service nursery scheduled in wait_for_parent_stdin_hijack. By simply cancelling this scope in Actor.cancel() deadlocks are avoided since the child task causing the lock is properly killed now (though i'm slightly confused because the task is scheduled on the service nursery, and that nursery is getting cancelled :shrug:).
Ahhh! It's probably that the channel server is blocking for the task to finish and since this tty lock request task is not put in the rpc task set the "all channels are ded" event never comes in because we have a task locked in the parent from the service nursery that's never cancelled. This looks to be the source of the deadlock and should be resolved now :surfer:
I'm gonna close this once some tests come in.
The good news is that working through the SIGINT stuff did allow finding some issues with Pdbs handler state mangling which we're now disabling in order to avoid nasty behavior if a user triggers ctl-c while inside the debugger prompt.