dropbear icon indicating copy to clipboard operation
dropbear copied to clipboard

FAILED test_channels.py::test_tcpflushout[L-1] - OSError: [Errno 98] Address already in use

Open vt-alt opened this issue 2 months ago • 8 comments

Some tests are not robust, intermittently causing failures:

=========================== short test summary info ============================
FAILED test_channels.py::test_tcpflushout[L-1] - OSError: [Errno 98] Address ...
FAILED test_channels.py::test_tcpflushout[L-4000] - OSError: [Errno 98] Addre...
FAILED test_channels.py::test_tcpflushout[L-40000] - OSError: [Errno 98] Addr...
FAILED test_channels.py::test_tcpflushout[R-1] - ConnectionRefusedError: [Err...
FAILED test_channels.py::test_tcpflushout[R-4000] - OSError: [Errno 98] Addre...
FAILED test_channels.py::test_tcpflushout[R-40000] - OSError: [Errno 98] Addr...
======== 6 failed, 42 passed, 3 skipped, 1 xfailed, 19 rerun in 37.93s =========

Build and test logs glibc https://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1012/error/dropbear-2025.88-alt1 musl https://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1012/error/dropbear-musl-2025.88-alt1

There maybe high load average during the tests run.

vt-alt avatar Oct 12 '25 19:10 vt-alt

"Address In Use" suggests multiple tests might be running at once. Could that be happening in your setup? Currently it uses a fixed port 7788. I guess it could be randomised, but there would still be some small chance of collisions.

mkj avatar Oct 14 '25 02:10 mkj

Looking a bit more, it might be a change in python 3.13. Could you try changing

https://github.com/mkj/dropbear/blob/003c5fcaabc114430d5d14142e95ffdbbd2d19b6/test/test_dropbear.py#L78-L79 to

    allow_reuse_address = True
    allow_reuse_port = True

(I'm not sure if allow_reuse_addr was correct previously)

mkj avatar Oct 14 '25 02:10 mkj

About parallel runs, we don't change anything to make the tests run in parallel, also other builds on the build farm are network namespaced so they cannot interfere. I can add the change, but it would be hard to quick test, since the failures happen only 1-2 times per week (we test mass rebuild once per day). So, we'll have to wait for the conditions to reproduce. Thanks!

vt-alt avatar Oct 14 '25 04:10 vt-alt

OK, I'll see if I can provoke it here to reproduce. Thanks

mkj avatar Oct 14 '25 04:10 mkj

I added the change to our package, will see in 1-2 weeks if it's helped.

vt-alt avatar Oct 15 '25 08:10 vt-alt

With the change applied it started to fail even more frequent with timeouts and ConnectionRefusedError. Logs with timeout (happen every day): http://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1019/error/dropbear-2025.88-alt2 http://git.altlinux.org/beehive/logs/Sisyphus/i586/archive/2025/1018/error/dropbear-musl-2025.88-alt2 http://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1017/error/dropbear-musl-2025.88-alt2

++ pytest3 --reruns=3 --hostkey=fakekey --dbclient=../dbclient --dropbear=../dropbear
============================= test session starts ==============================
platform linux -- Python 3.13.0, pytest-8.4.2, pluggy-1.6.0
rootdir: /usr/src/RPM/BUILD/dropbear-2025.88/test
plugins: rerunfailures-16.1
collecting ... 
collecting 28 items                                                            
collected 52 items                                                             

test_aslr.py .                                                           [  1%]
test_channels.py .x...................RRR.RRRFRRRFRR.
hasher-privd: parent: handle_io: idle time limit (3600 seconds) exceeded

Log with other errors (just once so far): https://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1018/error/dropbear-musl-2025.88-alt2

=========================== short test summary info ============================
FAILED test_channels.py::test_tcpflushout[L-1] - ConnectionRefusedError: [Err...
FAILED test_channels.py::test_tcpflushout[L-4000] - ConnectionRefusedError: [...
FAILED test_channels.py::test_tcpflushout[L-40000] - ConnectionRefusedError: ...
FAILED test_channels.py::test_tcpflushout[R-1] - ConnectionRefusedError: [Err...
FAILED test_channels.py::test_tcpflushout[R-4000] - ConnectionRefusedError: [...
FAILED test_channels.py::test_tcpflushout[R-40000] - ConnectionRefusedError: ...
======== 6 failed, 42 passed, 3 skipped, 1 xfailed, 18 rerun in 48.68s =========

Perhaps, Address already in use converted into timeouts and ConnectionRefusedError. With such errors it's better to not run tests at all because regular false positives would make other possible real errors not noticed.

vt-alt avatar Oct 19 '25 14:10 vt-alt

OK, I think that change has made it progress further. (Confirmed locally here with strace that allow_reuse_address = True sets SO_REUSEADDR, which wasn't being set before).

Maybe on a heavily loaded system the 0.1s sleep isn't adequate. Could you try say 0.5s instead for https://github.com/mkj/dropbear/blob/003c5fcaabc114430d5d14142e95ffdbbd2d19b6/test/test_channels.py#L109-L110 ?

mkj avatar Oct 22 '25 14:10 mkj

Sorry, I cannot test like this. I don't control our rebuild pipelines and they are only applied to the released packages in the repository. So, just to test I need to update/commit the package which would look just weird to the users and QA. We need to invent how to simulate the interfering load while tests are run, which I did not get the idea yet, but I don't abandon thinking about it.

vt-alt avatar Nov 12 '25 10:11 vt-alt

I was trying to reproduce it.

  • Method 1: Lock the qemu-kvm vm with pytests on 2 cpu while run stress-ng --cpu load on these 2 cpus -- this does not reproduce high load situation.
  • Method 2: Run qemu in TCG mode -- this easily reproduces ConnectionRefusedError even without stress-ng or cpu locking: test_channels.py .x..................FFFFFF

Now I try your suggestion to modify time.sleep() value on line 109.

  • time.sleep(1) - tests still fail.
  • time.sleep(10) - tests still fail.
  • Modify all time.sleep(0.1) in test_tcpflushout to time.sleep(1) - tests fail the same way.

Well, the TCG method is not very good since we will run under KVM, so I need to invent the way to reproduce this with KVM to make it more realistic.

vt-alt avatar Dec 16 '25 03:12 vt-alt

Tests start to fail (on normal run inside of KVM vm) even with time.sleep(0.2) (on line 109) or higher sleep values.

vt-alt avatar Dec 16 '25 03:12 vt-alt

I can increase sleep time to time.sleep(0.5) IF I increase sleep value on this line: r = dbclient(request, f"-{fwd_flag}", "7788:localhost:3344", "sleep 0.5; echo -n done", If the shell sleep value is smaller than time.sleep then test_tcpflushout tests always fail with ConnectionRefusedError.

vt-alt avatar Dec 16 '25 04:12 vt-alt

If the shell sleep value is smaller than in test_tcpflushout the tests always fail with ConnectionRefusedError.

Maybe instead of

https://github.com/mkj/dropbear/blob/d193731630a62482855b450daa1d5a5e13a90125/test/test_channels.py#L107-L110

it should be something like

		r = dbclient(request, f"-{fwd_flag}", "7788:localhost:3344", "echo -n start; sleep 0.1; echo -n done",
			text=True, background=True, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
		# wait to let the listener start
		assert r.stdout.read(5) == 'start'

Then the final sleep(0.1) might need changing too. Perhaps could poll() in a loop with a sleep or something.

https://github.com/mkj/dropbear/blob/d193731630a62482855b450daa1d5a5e13a90125/test/test_channels.py#L132-L134

mkj avatar Dec 16 '25 04:12 mkj

After I added echo -n start and repalced time.sleep(0.1) with assert r.stdout.read(5) == 'start' the tests stop failing on TCG vm.

vt-alt avatar Dec 16 '25 04:12 vt-alt