tractor icon indicating copy to clipboard operation
tractor copied to clipboard

Do we need to detect KBI for debugging?

Open goodboy opened this issue 5 years ago • 4 comments

I've been running into this problem using the new debugger support from #129:

  • ctrl-c while already in debugger
  • SIGINT during a request for the tty lock in the root-parent from a child process

In some cases this causes awkward crashing with errors other then the expected KBI and in the worst case the program actually hangs since the child causes deadlock trying to acquire the parent's tty lock (and the parent is waiting for the child to terminate - speaking of which this might be a good test as part of #145).

I got some tips in the trio gitter channel.

@oremanj said:

with the current Ctrl+C handling, I don't think it's possible -- you don't know the whole tree is getting cancelled, because the KI starts at whatever-the-main-task-is-doing and propagates upward cancelling one layer at a time. there's a somewhat stalled effort to improve this. you're probably best off implementing your own SIGINT handling

if you want a non-good way to directly detect a propagating KI exception, uh, I guess you can do something like:

task = trio.lowlevel.current_task()
nurseries_outward = task.child_nurseries[::-1]
while task.parent_nursery is not None:
    nurseries_outward.append(task.parent_nursery)
    task = task.parent_nursery.parent_task
for nursery in nurseries_outward:
    if nursery.cancel_scope.cancel_called and any(isinstance(exc, KeyboardInterrupt) for exc in nursery._pending_excs:
        return True
return False

@njsmith followed up with:

one quirk of SIGINT, if you're using subprocesses, is that usually the OS sends separate copies of the signal to all of your subprocesses

To which I suggested separate SIGINT handlers per proc, to which he responsed:

or else disable SIGINT in the child processes and rely on some other mechanism for propagating cancellation across process boundaries

@oremanj also mentioned:

the parent SIGINT will cancel trio.run_process which will deliver the cancellation to the children using its deliver_cancel argument, you can modify that to send them a different signal, send some state via a non-signal mechanism, etc the default sends SIGTERM, waits, sends SIGKILL. yeah it sounds like you probably just want to block SIGINT in your subprocesses

So I think the solution might be best accomplished via the suggestion of ignoring SIGINT in child procs and using our cancel method call to accomplish that. Also, the deliver_cancel system @oremanj mentioned is on my list to check out.

goodboy avatar Aug 19 '20 03:08 goodboy

Ok, so figured out what I think is the main bug to do with the #129 stuff.

  • run a program with a await tractor.breakpoint() in it
  • execute the continue command to resume the program
  • use ctrl-c to kill the process tree
  • the debugger (might) get(s) entered again
  • execute continue again to hit the deadlock

This more often then not seems to result in some kind of deadlock where the parent is waiting on the child to terminate but the child is waiting on the lock from the parent somehow, I think.

goodboy avatar Sep 09 '20 16:09 goodboy

Alright dug slightly into using deliver_cancel:

goodboy avatar Sep 27 '20 20:09 goodboy

Disabled SIGINT in children and all other machinery seems to still work in the test suite; so I guess that's good?

It's still not solving the main issue with debugging where if you ctl-c from the pdbpp prompt and then continue the parent process is somehow getting in a super weird deadlock state with the children where the children aren't being cancelled. Killing the children manually using a system tool seems to then unblock the root actor and it tears down with KeyboardInterrupt. Might try disabling sigint handling while inside the debugger now.

goodboy avatar Sep 27 '20 22:09 goodboy

I think the issue has been resolved by allocating a separate cancel scope for the service nursery scheduled in wait_for_parent_stdin_hijack. By simply cancelling this scope in Actor.cancel() deadlocks are avoided since the child task causing the lock is properly killed now (though i'm slightly confused because the task is scheduled on the service nursery, and that nursery is getting cancelled :shrug:).

Ahhh! It's probably that the channel server is blocking for the task to finish and since this tty lock request task is not put in the rpc task set the "all channels are ded" event never comes in because we have a task locked in the parent from the service nursery that's never cancelled. This looks to be the source of the deadlock and should be resolved now :surfer:

I'm gonna close this once some tests come in.

The good news is that working through the SIGINT stuff did allow finding some issues with Pdbs handler state mangling which we're now disabling in order to avoid nasty behavior if a user triggers ctl-c while inside the debugger prompt.

goodboy avatar Oct 02 '20 15:10 goodboy