pants icon indicating copy to clipboard operation
pants copied to clipboard

`@rules` which use blocking calls may not exit before the main thread with `--no-pantsd`

Open stuhood opened this issue 2 years ago • 1 comments

As determined by @thejcannon in #15771, when:

  1. the Python portion of a @rule takes a long time to complete (likely due to blocking code: in #15771, the run goal took a long time to tear down a sandbox which had been created using synchronous python-level tempdir facilities)
  2. --no-pantsd is set
  3. Ctrl+C / SIGINT is sent

...then the cancellation of the Session that is triggered by SIGINT will cause the Scheduler::execute loop to exit, which might cause the main thread to exit before @rules (which have been spawned into a background task by the Graph) can be cancelled (by reaching their next await point or return). This triggers a fatal error like:

Fatal Python error: This thread state must be current when releasing
Python runtime state: finalizing (tstate=0x2a10660)

To resolve this, we would need to ensure that regardless of why @rules are taking a long time to exit (blocking code or no), we wait for them to be cancelled before exiting. There are two potential levels at which to enforce this:

  1. the Scheduler/Graph - Scheduler::execute cancels tasks, but the cancellation is async: the tasks will continue running until their next await point (generally a few microseconds for non-blocking code). In general, this is good: it makes cancellation snappy, and ensures that work is in a clean state for the next request (in the case of pantsd). But in the case of a @goal_rule, an argument could be made that we should actually wait for it to exit before continuing (since it will still have access to the Console/Workspace).
  2. the Executor - All work running in the Graph (triggered by the Scheduler) is spawnd onto the Executor. If before exiting the main thread we ensured that the Executor had been drained, then we could be certain that no more background interaction with Python was happening. But that might represent an ongoing game of whack-a-mole, since all references to background spawn'd tasks would need to be dropped before the Executor would exit (likely meaning the need to null/None-out various engine fields to trigger GC).

stuhood avatar Jul 08 '22 21:07 stuhood

This relates to #11688, which suggests that one option would be to reify async tasks that we know must exit (rather than necessarily assuming that everything on the executor must exit).

stuhood avatar Sep 08 '22 19:09 stuhood

I'd like to make some progress on this ticket in the interest of attempting to address the:

FATAL: exception not rethrown

...error that has popped up recently in CI. It's a bit of a shot in the dark though, since "which" thread is failing to shut down in time is unclear.


#17974 takes a first step toward resolving this issue by unblocking using separate executors and teardown per Scheduler.

stuhood avatar Jan 11 '23 23:01 stuhood