mitogen icon indicating copy to clipboard operation
mitogen copied to clipboard

mitogen: first_stage: Break the while loop in case of EOF

Open mhartmay opened this issue 3 weeks ago • 23 comments

The current implementation can cause an infinite loop, leading to a process that hangs and consumes 100% CPU. This occurs because the EOF condition is not handled properly, resulting in repeated select(...) and read(...) calls.

The fix is to properly handle the EOF condition and break out of the loop when it occurs.

Fixes: https://github.com/mitogen-hq/mitogen/issues/1348

mhartmay avatar Dec 04 '25 16:12 mhartmay

Note: I did not test whether this actually fixes the problem since I did not reproduce the problem, but it's very likely as I had similar issues in the past.

mhartmay avatar Dec 04 '25 16:12 mhartmay

Question: Should we raise an error if not all data could be read?

Update: What speaks against raising an error is that it would be different to the "original code" behavior (before https://github.com/mitogen-hq/mitogen/pull/1307)

The original code reads:

        fp=os.fdopen(0,'rb')
        C=zlib.decompress(fp.read(PREAMBLE_COMPRESSED_LEN))
        fp.close()

fdopen returns an BufferedReader and the documentation for fdopen(..).read() says:

Fewer bytes may be returned than requested. [1]

So even with the old version it was possible that less than expected data was read and no exception was raised, so we should probably keep to that behavior for now.

[1] https://docs.python.org/3/library/io.html#io.BufferedIOBase.read

mhartmay avatar Dec 04 '25 17:12 mhartmay

Question: Should we raise an error if not all data could be read?

The call to zlib.decompress() raises an exception if the data is truncated.

moreati avatar Dec 05 '25 13:12 moreati

@moreati I've added timeout handling, but I'm not sure which kind of exception should be raised in case of a timeout (currently, I've Exception('TODO'). And I would rather use a large timeout as we do not know how fast the systems are.

Edit: Maybe even make the timeout value configurable.

mhartmay avatar Dec 05 '25 15:12 mhartmay

@moreati I've added timeout handling, but I'm not sure which kind of exception should be raised in case of a timeout (currently, I've Exception('TODO'). And I would rather use a large timeout as we do not know how fast the systems are.

Edit: Maybe even make the timeout value configurable.

I've added some preliminary steps to make it configurable and added a test for the timeout situation.

mhartmay avatar Dec 08 '25 08:12 mhartmay

By all means continue experimenting if you wish to, but be aware I'm unlikely to accept many new features into this chunk of code - particularly if they increase the size. My preference is to keep the first stage as minimal is possible.

To measure the size (after encoding and compression) use https://github.com/mitogen-hq/mitogen/blob/master/preamble_size.py.

I'm currently doing my experiments to pin down how {fobj,os}.read() behave in different circumstancess and across supported Python versions.

moreati avatar Dec 08 '25 09:12 moreati

Yep, right now I'm experimenting, especially in regard with the tests. Because it would be great if we have minimal tests for blocking and non-blocking STDIO. So the timeout test should be useful in all cases even for your SIGALRM solution :)

mhartmay avatar Dec 08 '25 09:12 mhartmay

By all means continue experimenting if you wish to, but be aware I'm unlikely to accept many new features into this chunk of code - particularly if they increase the size. My preference is to keep the first stage as minimal is possible.

Fine with me :) Reason why I'm thinking about making the timeout configurable is that the halting problem is undecidable and the timeout depends on the system/environment.

To measure the size (after encoding and compression) use https://github.com/mitogen-hq/mitogen/blob/master/preamble_size.py.

Will add the diff to all commits then :) Do we care more about the SSH command size? preamble? Or mitogen.parent? Or all of them?

I'm currently doing my experiments to pin down how {fobj,os}.read() behave in different circumstancess and across supported Python versions.

Good!

mhartmay avatar Dec 08 '25 09:12 mhartmay

Will add the diff to all commits then :) Do we care more about the SSH command size? preamble? Or mitogen.parent? Or all of them?

One before and after for a complete PR will be fine. SSH command size is what I minimize.

moreati avatar Dec 08 '25 09:12 moreati

@moreati I've now added two test cases, one that causes an timeout and the other that causes the EOF situation similar to the original bug.

mhartmay avatar Dec 08 '25 10:12 mhartmay

Based on https://github.com/moreati/nonblock_lab/commit/c4938b9e927a87b0a91b6ffddc4197b6bdc77fcf

  • other end of a fifo/pipe is closed (never opened)
    • Python 2.x and Pytrhon 3.x <x>.read() both return empty byte string
  • other end is open, but empty
    • Python 2.x <x>.read() raises an EAGAIN/EWOULDBLOCK derived exception
    • Python 3.x fobj.read() returns None
    • Python 2.x os.read() raises BlockingIOError

moreati avatar Dec 08 '25 12:12 moreati

@moreati I've now added one test case that runs the full stage code (tests.first_stage_test.CommandLineTest.test_stage).

I'm going to add multiple other test cases, e.g. stdin is a blocking FD, stdin is non-blocking, different pipe sizes, and maybe even passing a FD reading from a file.

e.g.

with open("input.file", "rb") as f: 
    proc = subprocess.Popen(
        ...
        stdin=f,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    out, err = proc.communicate()

Do you have other ideas? Not sure if the file case if even needed...

Edit: To me it seems like a good idea to add more test cases to first_stage_test.py as we have everything under control that we want to test and it's easier to debug than a larger stack.

mhartmay avatar Dec 08 '25 12:12 mhartmay

@moreati Is there a difference to run test first_stage_test via ./run_tests.sh or manually via Python unittest executor?

That's the command line I'm using during development:

$ PYTHONPATH=$(pwd)/tests:$PYTHONPATH python -m unittest -v tests.first_stage_test.CommandLineTest
test_eof_too_early (tests.first_stage_test.CommandLineTest.test_eof_too_early)
The boot command should write an ECO marker to stdout, read the ... ok
test_stage (tests.first_stage_test.CommandLineTest.test_stage)
Test that first stage works ... ok
test_stdin_blocking (tests.first_stage_test.CommandLineTest.test_stdin_blocking)
Test that first stage works with blocking STDIN ... ok
test_stdin_non_blocking (tests.first_stage_test.CommandLineTest.test_stdin_non_blocking)
Test that first stage works with non-blocking STDIN ... ok
test_timeout_error (tests.first_stage_test.CommandLineTest.test_timeout_error)
The boot command should write an ECO marker to stdout, read the ... ok
test_valid_syntax (tests.first_stage_test.CommandLineTest.test_valid_syntax)
Test valid syntax ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.789s

OK

mhartmay avatar Dec 08 '25 16:12 mhartmay

@moreati Is there a difference to run test first_stage_test via ./run_tests.sh or manually via Python unittest executor?

Okay, I found one difference:

MITOGEN_LOG_LEVEL=debug is enabled, that results in a different output. Will change the test accordingly.

mhartmay avatar Dec 09 '25 09:12 mhartmay

Is there a difference to run test first_stage_test via ./run_tests.sh or manually via Python unittest executor?

Almost certainly yes, but I don't remember exactly how. Probably something to do with what $CWD is set to, how Pythons sys.path is constructed and whther test modules are imported as foo_test or tests.foo_test. I have resigned myself to letting GH actions run the test suite, until I can overhaul it.

moreati avatar Dec 09 '25 09:12 moreati

@moreati Two questions:

  1. Is it okay, if I pick your commit as fix for the timeout situation in this PR (or how should we proceed)?
  2. Can we revert the .ci/mitogen_tests.py changes from commit [1] as we now have a minimal test cases that provides coverage of the non-blocking FD case?

IMO, as soon as the test cases are stable and the small coding style nits are being resolved, it might be a good idea to merge the tests, together with the EOF file fix (but without my timeout fix/changes). What do you think?

[1] https://github.com/mitogen-hq/mitogen/commit/85d6046f2fbf4e80c472f7a06596bf8ad1fcd9e6

mhartmay avatar Dec 09 '25 12:12 mhartmay

I'm still getting my head around things. So I'm not sure how to proceed yet

  1. Currently I don't think my SIGALRMattempt is the right fix, particularly given feedback from rda0 that it left idle (orphaned?) processes in https://github.com/mitogen-hq/mitogen/issues/1348#issuecomment-3501719660
  2. Possibly, but at the moment it's providing coverage in the Ansible tests as well and thats where the original error report came from.

I think your PR, or something evolved from it, is the best direction - particularly your tests.

moreati avatar Dec 09 '25 12:12 moreati

@moreati I couldn’t come up with a non-racy solution, so I implemented custom dummy connectors that use blocking and non-blocking I/O for STDIN/STDOUT.

There is still one unstable test left:

FAIL: test_timeout_error (first_stage_test.CommandLineTest.test_timeout_error)
The boot command should write an ECO marker to stdout, read the
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/runner/work/mitogen/mitogen/tests/first_stage_test.py", line 249, in test_timeout_error
    self.assertIn(
    ~~~~~~~~~~~~~^
        b("TimeoutError"),
        ^^^^^^^^^^^^^^^^^^
        stderr,
        ^^^^^^^
    )
    ^
AssertionError: b'TimeoutError' not found in b'Traceback (most recent call last):\n  File "<string>", line 1, in <module>\n    import sys;sys.path=[p for p in sys.path if p];import binascii,os,select,zlib;exec(zlib.decompress(binascii.a2b_base64(sys.argv[1]),-15))\n                                                                                  ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File "<string>", line 27, in <module>\nzlib.error: Error -5 while decompressing data: incomplete or truncated stream\n'

For 'some reasons' I do not get the error -5. I can change the expectation, but I would like to be able to understand and reproduce the issue on my local system.

mhartmay avatar Dec 09 '25 15:12 mhartmay

I've begun trying variations of your tests in https://github.com/mitogen-hq/mitogen/pull/1392, I'm finishing for the day now but tomorrow or later this week would you like to do a call (e.g. Jitsi, Zoom, Teams, ...) to compare notes and look at failing tests?

moreati avatar Dec 09 '25 16:12 moreati

I've begun trying variations of your tests in #1392, I'm finishing for the day now but tomorrow or later this week would you like to do a call (e.g. Jitsi, Zoom, Teams, ...) to compare notes and look at failing tests?

Sure, just send me an email.

I've added more tests and improved the tests, so most scenarios should now be covered.

mhartmay avatar Dec 10 '25 08:12 mhartmay

@moreati I've polished the commits/tests a little so they should be stable on all environments we support (I did not test with Windows Subsystem for Linux (WSL)).

The whole series results in the following size changes:

-SSH command size: 822
+SSH command size: 925

                         Original           Minimized           Compressed
-mitogen.parent        98746  96.4KiB  51215 50.0KiB 51.9%  12922 12.6KiB 13.1%
+mitogen.parent        99311  97.0KiB  51441 50.2KiB 51.8%  13009 12.7KiB 13.1%

Update: Can you please push your commit https://github.com/mitogen-hq/mitogen/pull/1392/commits/733f4bca81d14da8af70cef61da74f3cab813503 as it makes sense anyway. When it's pushed I can add some comments to the first stage.

Covered test scenarios:

  • Closed STDIN
  • Closed STDOUT
  • Closed STDERR
  • Timeout while waiting for data on STDIN
  • premature EOF on STDIN (reason for https://github.com/mitogen-hq/mitogen/issues/1348)

mhartmay avatar Dec 10 '25 11:12 mhartmay

Can you please push your commit 733f4bc as it makes sense anyway. When it's pushed I can add some comments to the first stage.

Done in #1393

moreati avatar Dec 10 '25 15:12 moreati

Thanks for all this. When you're ready I'll do a full review.

moreati avatar Dec 10 '25 15:12 moreati

Thanks for all this. When you're ready I'll do a full review.

I've reworked some parts, e.g. dropped the timeout handling in the first_stage. Instead we rely on the broker connection timeout handling.

I’m curious why that timeout handling approach by the broker didn’t work in the case of the bug report. Could it be that the reporters didn’t wait long enough for the connect timeout, or that the child process wasn’t actually killable? I think we should revisit this.

Update: When I'm not mistaken, then in the original issue https://github.com/mitogen-hq/mitogen/issues/1348 does exactly what I've thought about, they escalate the privileges by using passwordless sudo to become root. So the broker (running as non-root) is unable to kill the child in case of a timeout. Does that make any sense?

mhartmay avatar Dec 15 '25 12:12 mhartmay

Note: I kept the commit https://github.com/mitogen-hq/mitogen/pull/1389/commits/a28d32d86db32ac43dfac271cc3ee1069a1b205d only to highlight the differences between my previous version and the new one.

mhartmay avatar Dec 15 '25 13:12 mhartmay