cockpit kill subprocess when closing channel

Feb 14 '23 07:02 martinpitt

I feel like subprocess might be a better target for what you're trying to bring back, since in the default case, that is a socket...

That being said, I have this wild idea to replace the subprocess transport with the default standard library socket transport in the non-pty case...

But I keep remembering: the reason we brought all of this transport insanity in the first place was for two reasons:

stdio (which we used to have our ugly adapter class for)
subprocesses with ptys (which jelle tried to get working with the standaed library, to no avail)

Feb 14 '23 09:02 allisonkarlitskaya

I ported the tests to SubprocessTransport. I disabled the last two, as they hang forever → closing a SubprocessTransport does not seem to end the reading loop. This may just be me not understanding this fully and porting the tests wrongly, or it's an actual bug. Keeping as draft for that reason.

So far coverage is up from 92% to 94%.

Feb 14 '23 13:02 martinpitt

Closing a channel not actually killing the process sounds very much like what #18292 is trying to address. So I propose I disable these two extra unit tests for the time being, and look at them again after #18292 lands?

Feb 15 '23 06:02 martinpitt

lis | pitti: ya, so tl;dr: i think we should follow the stdlib behaviour here. it sends sigkill on close() lis | pitti: https://github.com/python/cpython/blob/main/Lib/asyncio/base_subprocess.py#L96

Feb 15 '23 09:02 martinpitt

I re-enabled the tests, with an explicit .kill() and a # FIXME comment. I'll look into actually implementing the kill-on-close as a next step.

Feb 15 '23 11:02 martinpitt

I now added a second commit which does the "kill-on-close" thing like asyncio does. Now the unit tests are "clean" again. @allisonkarlitskaya , PTAL?

Feb 15 '23 12:02 martinpitt

@allisonkarlitskaya Oh noes, this is no good -- the PermissionError in testSELinuxRestrictedUser could be ignored, but the other two child process pid 6300 exit status already read: will report returncode 255 sound more serious, it's stepping on its own toes. /me looks for suggestions

Feb 15 '23 12:02 martinpitt

@allisonkarlitskaya Oh noes, this is no good -- the PermissionError in testSELinuxRestrictedUser could be ignored, but the other two child process pid 6300 exit status already read: will report returncode 255 sound more serious, it's stepping on its own toes. /me looks for suggestions

This is the same issue we had before and it's because we're calling .kill() which is activating the code inside of subprocess which conflicts with our pidfd.

Feb 15 '23 13:02 allisonkarlitskaya

Initially I thought it would be the .poll() that would cause that, but indeed it could be the .kill() -- I faintly remember replacing a similar situation with some os.signal(KILL) to avoid messing with the child handler. I'll try that! (Just a drive-by comment between running and dinner, just wanted to capture the thought to my future self)

Feb 15 '23 16:02 martinpitt

Et voilá, green!

Feb 15 '23 20:02 martinpitt

lis | i think that the following needs to happen:
lis | 1) we need to add your code to kill the subprocess to the transport class
lis | 2) we need to remove the similar code from the subprocess stream channel
lis | 3) we need to add generic code to ProtocolChannel to handle close messages by shutting down the transport

Feb 16 '23 10:02 martinpitt

So, without the second commit and without the hack this still does not work.

lis | 1) we need to add your code to kill the subprocess to the transport class

There's SubprocessTransport._close() now which does that. However, it's currently only called in __del__ (which does not usually apply, at least not at the right time), and in abort(). The latter actually could be called by the _Transport.close() parent class, but it is skipped because of the if self._queue is not None: return. This says "abort() will be called from _write_ready() when it's done".

lis | 2) we need to remove the similar code from the subprocess stream channel lis | 3) we need to add generic code to ProtocolChannel to handle close messages by shutting down the transport

These both happened in commit b655ab8bfae0 . Hence this still isn't sufficient.

The tests work if I disable the if self._queue is not None: return short-circuit in _Transport.close(), but I suppose that's not intended. The comment says "abort() will be called from _write_ready() when it's done". _write_read() is called, but dutifully writes the buffer to its out_fd (219264 bytes in my current run), which works as the cat process is still running. So the two abort() cases in _write_ready() never hit.

But this comment/assumption seems a bit dubious to me -- when the channel gets closed, the protocol certainly shouldn't go on writing into the socket as if nothing has happened -- we want to close the channel now.

So with my naïve eyes, this would actually make more sense:

--- src/cockpit/transports.py
+++ src/cockpit/transports.py
@@ -227,11 +227,6 @@ class _Transport(asyncio.Transport):
 
         self._closing = True
         self._close_reader()
-
-        if self._queue is not None:
-            # abort() will be called from _write_ready() when it's done
-            return
-
         self.abort()
 
     def get_protocol(self) -> asyncio.BaseProtocol:

That fixes test_write_backlog_close(), but still breaks test_write_backlog_eof_and_close(). I think the latter needs to be reorganized after commit b655ab8bfae0.

@allisonkarlitskaya At this point I think your mental model of how this all should work is much better than mine. Could you please take over here?

Feb 18 '23 10:02 martinpitt

So the issue:

close() waits until the output drains before closing the transport. This matches Python stdlib behaviour, and this is indeed the very thing we're trying to test in these testcases. If you want "close right now", that's abort().
on the other hand, close() closes down the reader end immediately
cat blocks in writing because we're not reading, which means that it can't read anymore data, which means that our output queue also backs up. Again, that's exactly what we're trying to test.

This wasn't an issue before because closing the write end was an independent operation from closing the reader. Now that we call .close() on the subprocess, and the reading stops as a side-effect, we get a deadlock.

I think the next step should be to check what the standard library does this this situation.

Feb 20 '23 17:02 allisonkarlitskaya

I'm still confused -- when the wire protocol (JSON) closes a channel, I understood this as "zap it now". The "blocks until writing/draining the output queue" feels more like the "done" control message to me? Or should "close" in the protocol map to calling .abort(), not .do_close()?

Feb 20 '23 17:02 martinpitt

.done() is more like .write_eof(). It's a half-close.

We could argue about if we map the "close" control message to .close() or .abort(). Considering the fact that we just adjusted things to make sure we run tasks to completion in D-Bus on close, it would feel like a step backwards to me to introduce the "kill it in the middle" non-determinism here.

That being said: the research result

The standard library envisions exactly this sort of problem and includes a warning about it:

Warning: Use the communicate() method rather than process.stdin.write(), await process.stdout.read() or await process.stderr.read(). This avoids deadlocks due to streams pausing reading or writing and blocking the child process.

It's all irrelevant anyway, because on shutdown, the subprocess transport in the standard library closes all of the streams (via .close(), not .abort()) and then immediately sends SIGKILL to the subprocess. The underlying stdin stream would like to drain before finishing being closed, but it don't get the chance to — it gets POLLERR and gets closed immediately, dropping the buffer on the floor. The distinction here: if someone had passed stdin elsewhere, it would still be able to finish reading.

I think the solution to our problem then looks something like continuing to drain the queue, but in the case of a subprocess, doing the kill immediately rather than waiting until after stdin finished draining.

Unfortunately, that makes the subprocess transport no longer entirely suitable for what this testcase is trying to test... which is making sure that the queue gets a chance to be properly drained on shutdown.

Feb 20 '23 20:02 allisonkarlitskaya

I rebased that to current main, to pick up the "major peer initialization overhaul" and resolve the conflict. The situation is unchanged, though -- the first commit adds a "FIXME" hack, which the second one drops and replaces with "Kill subprocess when closing channel". Lis, you mentioned a different approach above, but I feel this is more like in your league.

Mar 17 '23 11:03 martinpitt

I am reviving this. Having these unit tests is good, and having more "fallbacks" for cleaning up spawned processes may help with actual bugs like #18976.

Jun 21 '23 05:06 martinpitt

TestNetworkingSettings.testOtherSettings fails quite often on ubuntu-stable. Need to check if it's really unrelated (it's not a spawn channel, but D-Bus)

Update: It was a flake, retry worked. Added to my todo list.

Jun 21 '23 05:06 martinpitt

@allisonkarlitskaya : Gentle review ping? If you have any doubts about the second commit, I'm happy to drop it, but I'd like to at least get the first one in. (You asked me to re-add unit tests for the transports, after all :grin: )

Jun 27 '23 04:06 martinpitt

Please see the suggested change. I'd probably opt for squashing everything together and landing it as one commit.

Jul 17 '23 14:07 allisonkarlitskaya

Excellent, thanks for cleaning this up! I'm fine with just squashing this, easier all around. Can you formally approve?

Jul 17 '23 15:07 martinpitt

Hmm, this failure happened twice now, this feels more systematic.

asyncio-ERROR: Task was destroyed but it is pending!

Jul 17 '23 15:07 martinpitt

Hmm, this failure happened twice now, this feels more systematic.
asyncio-ERROR: Task was destroyed but it is pending!

My guess: incorrect handling of BrokenPipeError in one of our subprocesses?

I guess we'd avoid that if we just killed the thing off :)

Jul 17 '23 16:07 allisonkarlitskaya

Rebased to see where things stand one year later with your fixup.

Apr 16 '24 08:04 martinpitt

So the "different approach" in commit 3dda7c4995f552451add3162ffdf1951ba781589 still causes lots of red (see statuses there). They all look like this:

asyncio-ERROR: Exception in callback None()
handle: <Handle cancelled>
Traceback (most recent call last):
File "/usr/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/usr/lib/python3.11/site-packages/cockpit/transports.py", line 114, in _read_ready
keep_open = self._protocol.eof_received()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/site-packages/cockpit/router.py", line 234, in eof_received
endpoint.do_close()
File "/usr/lib/python3.11/site-packages/cockpit/channels/filesystem.py", line 317, in do_close
self._watch.close()
^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'close'

Apr 16 '24 08:04 martinpitt

This still feels like an important bug fix to me, but then again it doesn't seem to hurt much in practice. "Pick your battles"

Nov 30 '24 03:11 martinpitt