mitogen icon indicating copy to clipboard operation
mitogen copied to clipboard

AssertionError: Docker container '...' contained extra running processes after test completed

Open s1113950 opened this issue 5 years ago • 4 comments

https://travis-ci.org/dw/mitogen/jobs/653211748?utm_medium=notification&utm_source=github_status

Sometimes tests fail with:

======================================================================
ERROR: tearDownClass (doas_test.DoasTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/dw/mitogen/tests/testlib.py", line 534, in tearDownClass
    cls.dockerized_ssh.check_processes()
  File "/home/travis/build/dw/mitogen/tests/testlib.py", line 477, in check_processes
    counts
AssertionError: Docker container 'mitogen-test-a84b6c7898599fd6' contained extra running processes after test completed: {u'doas <defunct>': 1, u'ps': 1, u'sshd': 1}
======================================================================
ERROR: tearDownClass (sudo_test.NonEnglishPromptTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/dw/mitogen/tests/testlib.py", line 534, in tearDownClass
    cls.dockerized_ssh.check_processes()
  File "/home/travis/build/dw/mitogen/tests/testlib.py", line 477, in check_processes
    counts
AssertionError: Docker container 'mitogen-test-b3754356ae2ff3e' contained extra running processes after test completed: {u'ps': 1, u'sshd': 1, u'sudo <defunct>': 1}
----------------------------------------------------------------------
Ran 670 tests in 103.324s
FAILED (errors=2, skipped=45)

s1113950 avatar Feb 20 '20 23:02 s1113950

discovered when testing https://github.com/dw/mitogen/pull/658 and https://github.com/dw/mitogen/pull/693

s1113950 avatar Feb 20 '20 23:02 s1113950

This test also sometimes has issues, maybe 4/10 test runs:

ERROR: tearDownClass (sudo_test.NonEnglishPromptTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/travis/build/dw/mitogen/tests/testlib.py", line 534, in tearDownClass
    cls.dockerized_ssh.check_processes()
  File "/home/travis/build/dw/mitogen/tests/testlib.py", line 477, in check_processes
    counts
AssertionError: Docker container 'mitogen-test-e82c96aa0f304f6a' contained extra running processes after test completed: {u'ps': 1, u'sshd': 1, u'sudo <defunct>': 1}

https://travis-ci.org/dw/mitogen/jobs/653673605?utm_medium=notification&utm_source=github_status example test fail

s1113950 avatar Feb 21 '20 23:02 s1113950

This still occurs intermittently on Azure Devops. While experimenting with Github Actions I'm seeing it reliably fail as follows, in py36-mode_mitogen-distro_centos7

test_unknown (utils_test.CastTest) ... ok
test_run_with_broker (utils_test.RunWithRouterTest) ... ok
test_with_broker (utils_test.WithRouterTest) ... ok
test_with_broker_preserves_attributes (utils_test.WithRouterTest) ... ok

======================================================================
ERROR: tearDownClass (ssh_test.BannerTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/mitogen/mitogen/tests/testlib.py", line 654, in tearDownClass
    cls.dockerized_ssh.check_processes()
  File "/home/runner/work/mitogen/mitogen/tests/testlib.py", line 592, in check_processes
    processes,
AssertionError: Docker container 'mitogen-test-4fa107141314c30a' contained extra running processes after test completed: [('sshd', '/usr/sbin/sshd -D'), ('sshd', 'sshd: mitogen__has_sudo [priv]'), ('ps', 'ps -w -w -o ucomm= -o args=')]

----------------------------------------------------------------------
Ran 680 tests in 64.319s

FAILED (errors=1, skipped=66)

The following changes to testlib.py and ssh_test.py in isolation each made it go away, I haven't tested them in combination

@@ -609,6 +625,8 @@ def setUp(self):
        self.router = self.router_class(self.broker)

    def tearDown(self):
+       self.router.disconnect_all()
+       self.router.__exit__(None, None, None)
        del self.router
        super(RouterMixin, self).tearDown()

@@ -595,6 +614,7 @@ def tearDown(self):
            self.broker.shutdown()
        self.broker.join()
        del self.broker
+       time.sleep(1.0)
        super(BrokerMixin, self).tearDown()

    def sync_with_broker(self):
@@ -190,6 +190,7 @@ def test_verbose_enabled(self):
            self.dockerized_ssh.port,
        )
        self.assertEqual(name, context.name)
+       context.shutdown(wait=True)


class StubPermissionDeniedTest(StubSshMixin, testlib.TestCase):

My hypothesis is that the ssh connection is being left open by Mitogen, or it is relying on th GC to do cleanup, or we are closing the connection asynchronously and not waiting for it to complete.

moreati avatar Aug 23 '24 19:08 moreati

The following changes to testlib.py and ssh_test.py in isolation each made it go away, I haven't tested them in combination

Further testing revealed this is false. They might reduce occurence, but they haven't eliminated it.

moreati avatar Aug 26 '24 09:08 moreati

I'm starting to think this is a combination of

  1. Tests not closing their context with context.shutdown()
  2. cls.dockerized_ssh.check_processes() in teardownClass() is in a race condition with sshd cleaning up after the client disconnects.

My guesstimate: After the TCP socket is closed (by the client or the server) sshd will take some non-zero amount of time to exit or terminate the session processes. Since there is no communcation channel between the client and server, there is not way for the test suite to wait for the server processes to be gone.

If so, then the least worst option to address item 2 is waiting a short period before calling check_processes(), and or retrying a small number of times.

moreati avatar Sep 09 '24 12:09 moreati