mitogen: first_stage: Break the while loop in case of EOF
The current implementation can cause an infinite loop, leading to a process that hangs and consumes 100% CPU. This occurs because the EOF condition is not handled properly, resulting in repeated select(...) and read(...) calls.
The fix is to properly handle the EOF condition and break out of the loop when it occurs.
Fixes: https://github.com/mitogen-hq/mitogen/issues/1348
Note: I did not test whether this actually fixes the problem since I did not reproduce the problem, but it's very likely as I had similar issues in the past.
Question: Should we raise an error if not all data could be read?
Update: What speaks against raising an error is that it would be different to the "original code" behavior (before https://github.com/mitogen-hq/mitogen/pull/1307)
The original code reads:
fp=os.fdopen(0,'rb')
C=zlib.decompress(fp.read(PREAMBLE_COMPRESSED_LEN))
fp.close()
fdopen returns an BufferedReader and the documentation for fdopen(..).read() says:
Fewer bytes may be returned than requested. [1]
So even with the old version it was possible that less than expected data was read and no exception was raised, so we should probably keep to that behavior for now.
[1] https://docs.python.org/3/library/io.html#io.BufferedIOBase.read
Question: Should we raise an error if not all data could be read?
The call to zlib.decompress() raises an exception if the data is truncated.
@moreati I've added timeout handling, but I'm not sure which kind of exception should be raised in case of a timeout (currently, I've Exception('TODO'). And I would rather use a large timeout as we do not know how fast the systems are.
Edit: Maybe even make the timeout value configurable.
@moreati I've added timeout handling, but I'm not sure which kind of exception should be raised in case of a timeout (currently, I've
Exception('TODO'). And I would rather use a large timeout as we do not know how fast the systems are.Edit: Maybe even make the timeout value configurable.
I've added some preliminary steps to make it configurable and added a test for the timeout situation.
By all means continue experimenting if you wish to, but be aware I'm unlikely to accept many new features into this chunk of code - particularly if they increase the size. My preference is to keep the first stage as minimal is possible.
To measure the size (after encoding and compression) use https://github.com/mitogen-hq/mitogen/blob/master/preamble_size.py.
I'm currently doing my experiments to pin down how {fobj,os}.read() behave in different circumstancess and across supported Python versions.
Yep, right now I'm experimenting, especially in regard with the tests. Because it would be great if we have minimal tests for blocking and non-blocking STDIO. So the timeout test should be useful in all cases even for your SIGALRM solution :)
By all means continue experimenting if you wish to, but be aware I'm unlikely to accept many new features into this chunk of code - particularly if they increase the size. My preference is to keep the first stage as minimal is possible.
Fine with me :) Reason why I'm thinking about making the timeout configurable is that the halting problem is undecidable and the timeout depends on the system/environment.
To measure the size (after encoding and compression) use https://github.com/mitogen-hq/mitogen/blob/master/preamble_size.py.
Will add the diff to all commits then :) Do we care more about the SSH command size? preamble? Or mitogen.parent? Or all of them?
I'm currently doing my experiments to pin down how
{fobj,os}.read()behave in different circumstancess and across supported Python versions.
Good!
Will add the diff to all commits then :) Do we care more about the SSH command size? preamble? Or mitogen.parent? Or all of them?
One before and after for a complete PR will be fine. SSH command size is what I minimize.
@moreati I've now added two test cases, one that causes an timeout and the other that causes the EOF situation similar to the original bug.
Based on https://github.com/moreati/nonblock_lab/commit/c4938b9e927a87b0a91b6ffddc4197b6bdc77fcf
- other end of a fifo/pipe is closed (never opened)
- Python 2.x and Pytrhon 3.x
<x>.read()both return empty byte string
- Python 2.x and Pytrhon 3.x
- other end is open, but empty
- Python 2.x
<x>.read()raises an EAGAIN/EWOULDBLOCK derived exception - Python 3.x
fobj.read()returnsNone - Python 2.x
os.read()raisesBlockingIOError
- Python 2.x
@moreati
I've now added one test case that runs the full stage code (tests.first_stage_test.CommandLineTest.test_stage).
I'm going to add multiple other test cases, e.g. stdin is a blocking FD, stdin is non-blocking, different pipe sizes, and maybe even passing a FD reading from a file.
e.g.
with open("input.file", "rb") as f:
proc = subprocess.Popen(
...
stdin=f,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
out, err = proc.communicate()
Do you have other ideas? Not sure if the file case if even needed...
Edit: To me it seems like a good idea to add more test cases to first_stage_test.py as we have everything under control that we want to test and it's easier to debug than a larger stack.
@moreati
Is there a difference to run test first_stage_test via ./run_tests.sh or manually via Python unittest executor?
That's the command line I'm using during development:
$ PYTHONPATH=$(pwd)/tests:$PYTHONPATH python -m unittest -v tests.first_stage_test.CommandLineTest
test_eof_too_early (tests.first_stage_test.CommandLineTest.test_eof_too_early)
The boot command should write an ECO marker to stdout, read the ... ok
test_stage (tests.first_stage_test.CommandLineTest.test_stage)
Test that first stage works ... ok
test_stdin_blocking (tests.first_stage_test.CommandLineTest.test_stdin_blocking)
Test that first stage works with blocking STDIN ... ok
test_stdin_non_blocking (tests.first_stage_test.CommandLineTest.test_stdin_non_blocking)
Test that first stage works with non-blocking STDIN ... ok
test_timeout_error (tests.first_stage_test.CommandLineTest.test_timeout_error)
The boot command should write an ECO marker to stdout, read the ... ok
test_valid_syntax (tests.first_stage_test.CommandLineTest.test_valid_syntax)
Test valid syntax ... ok
----------------------------------------------------------------------
Ran 6 tests in 0.789s
OK
@moreati Is there a difference to run test
first_stage_testvia ./run_tests.sh or manually via Python unittest executor?
Okay, I found one difference:
MITOGEN_LOG_LEVEL=debug is enabled, that results in a different output. Will change the test accordingly.
Is there a difference to run test
first_stage_testvia ./run_tests.sh or manually via Python unittest executor?
Almost certainly yes, but I don't remember exactly how. Probably something to do with what $CWD is set to, how Pythons sys.path is constructed and whther test modules are imported as foo_test or tests.foo_test. I have resigned myself to letting GH actions run the test suite, until I can overhaul it.
@moreati Two questions:
- Is it okay, if I pick your commit as fix for the timeout situation in this PR (or how should we proceed)?
- Can we revert the
.ci/mitogen_tests.pychanges from commit [1] as we now have a minimal test cases that provides coverage of the non-blocking FD case?
IMO, as soon as the test cases are stable and the small coding style nits are being resolved, it might be a good idea to merge the tests, together with the EOF file fix (but without my timeout fix/changes). What do you think?
[1] https://github.com/mitogen-hq/mitogen/commit/85d6046f2fbf4e80c472f7a06596bf8ad1fcd9e6
I'm still getting my head around things. So I'm not sure how to proceed yet
- Currently I don't think my
SIGALRMattempt is the right fix, particularly given feedback from rda0 that it left idle (orphaned?) processes in https://github.com/mitogen-hq/mitogen/issues/1348#issuecomment-3501719660 - Possibly, but at the moment it's providing coverage in the Ansible tests as well and thats where the original error report came from.
I think your PR, or something evolved from it, is the best direction - particularly your tests.
@moreati I couldn’t come up with a non-racy solution, so I implemented custom dummy connectors that use blocking and non-blocking I/O for STDIN/STDOUT.
There is still one unstable test left:
FAIL: test_timeout_error (first_stage_test.CommandLineTest.test_timeout_error)
The boot command should write an ECO marker to stdout, read the
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/runner/work/mitogen/mitogen/tests/first_stage_test.py", line 249, in test_timeout_error
self.assertIn(
~~~~~~~~~~~~~^
b("TimeoutError"),
^^^^^^^^^^^^^^^^^^
stderr,
^^^^^^^
)
^
AssertionError: b'TimeoutError' not found in b'Traceback (most recent call last):\n File "<string>", line 1, in <module>\n import sys;sys.path=[p for p in sys.path if p];import binascii,os,select,zlib;exec(zlib.decompress(binascii.a2b_base64(sys.argv[1]),-15))\n ~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "<string>", line 27, in <module>\nzlib.error: Error -5 while decompressing data: incomplete or truncated stream\n'
For 'some reasons' I do not get the error -5. I can change the expectation, but I would like to be able to understand and reproduce the issue on my local system.
I've begun trying variations of your tests in https://github.com/mitogen-hq/mitogen/pull/1392, I'm finishing for the day now but tomorrow or later this week would you like to do a call (e.g. Jitsi, Zoom, Teams, ...) to compare notes and look at failing tests?
I've begun trying variations of your tests in #1392, I'm finishing for the day now but tomorrow or later this week would you like to do a call (e.g. Jitsi, Zoom, Teams, ...) to compare notes and look at failing tests?
Sure, just send me an email.
I've added more tests and improved the tests, so most scenarios should now be covered.
@moreati I've polished the commits/tests a little so they should be stable on all environments we support (I did not test with Windows Subsystem for Linux (WSL)).
The whole series results in the following size changes:
-SSH command size: 822
+SSH command size: 925
Original Minimized Compressed
-mitogen.parent 98746 96.4KiB 51215 50.0KiB 51.9% 12922 12.6KiB 13.1%
+mitogen.parent 99311 97.0KiB 51441 50.2KiB 51.8% 13009 12.7KiB 13.1%
Update: Can you please push your commit https://github.com/mitogen-hq/mitogen/pull/1392/commits/733f4bca81d14da8af70cef61da74f3cab813503 as it makes sense anyway. When it's pushed I can add some comments to the first stage.
Covered test scenarios:
- Closed STDIN
- Closed STDOUT
- Closed STDERR
- Timeout while waiting for data on STDIN
- premature EOF on STDIN (reason for https://github.com/mitogen-hq/mitogen/issues/1348)
Can you please push your commit 733f4bc as it makes sense anyway. When it's pushed I can add some comments to the first stage.
Done in #1393
Thanks for all this. When you're ready I'll do a full review.
Thanks for all this. When you're ready I'll do a full review.
I've reworked some parts, e.g. dropped the timeout handling in the first_stage. Instead we rely on the broker connection timeout handling.
I’m curious why that timeout handling approach by the broker didn’t work in the case of the bug report. Could it be that the reporters didn’t wait long enough for the connect timeout, or that the child process wasn’t actually killable? I think we should revisit this.
Update: When I'm not mistaken, then in the original issue https://github.com/mitogen-hq/mitogen/issues/1348 does exactly what I've thought about, they escalate the privileges by using passwordless sudo to become root. So the broker (running as non-root) is unable to kill the child in case of a timeout. Does that make any sense?
Note: I kept the commit https://github.com/mitogen-hq/mitogen/pull/1389/commits/a28d32d86db32ac43dfac271cc3ee1069a1b205d only to highlight the differences between my previous version and the new one.