FAILED test_channels.py::test_tcpflushout[L-1] - OSError: [Errno 98] Address already in use
Some tests are not robust, intermittently causing failures:
=========================== short test summary info ============================
FAILED test_channels.py::test_tcpflushout[L-1] - OSError: [Errno 98] Address ...
FAILED test_channels.py::test_tcpflushout[L-4000] - OSError: [Errno 98] Addre...
FAILED test_channels.py::test_tcpflushout[L-40000] - OSError: [Errno 98] Addr...
FAILED test_channels.py::test_tcpflushout[R-1] - ConnectionRefusedError: [Err...
FAILED test_channels.py::test_tcpflushout[R-4000] - OSError: [Errno 98] Addre...
FAILED test_channels.py::test_tcpflushout[R-40000] - OSError: [Errno 98] Addr...
======== 6 failed, 42 passed, 3 skipped, 1 xfailed, 19 rerun in 37.93s =========
Build and test logs glibc https://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1012/error/dropbear-2025.88-alt1 musl https://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1012/error/dropbear-musl-2025.88-alt1
There maybe high load average during the tests run.
"Address In Use" suggests multiple tests might be running at once. Could that be happening in your setup? Currently it uses a fixed port 7788. I guess it could be randomised, but there would still be some small chance of collisions.
Looking a bit more, it might be a change in python 3.13. Could you try changing
https://github.com/mkj/dropbear/blob/003c5fcaabc114430d5d14142e95ffdbbd2d19b6/test/test_dropbear.py#L78-L79 to
allow_reuse_address = True
allow_reuse_port = True
(I'm not sure if allow_reuse_addr was correct previously)
About parallel runs, we don't change anything to make the tests run in parallel, also other builds on the build farm are network namespaced so they cannot interfere. I can add the change, but it would be hard to quick test, since the failures happen only 1-2 times per week (we test mass rebuild once per day). So, we'll have to wait for the conditions to reproduce. Thanks!
OK, I'll see if I can provoke it here to reproduce. Thanks
I added the change to our package, will see in 1-2 weeks if it's helped.
With the change applied it started to fail even more frequent with timeouts and ConnectionRefusedError. Logs with timeout (happen every day): http://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1019/error/dropbear-2025.88-alt2 http://git.altlinux.org/beehive/logs/Sisyphus/i586/archive/2025/1018/error/dropbear-musl-2025.88-alt2 http://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1017/error/dropbear-musl-2025.88-alt2
++ pytest3 --reruns=3 --hostkey=fakekey --dbclient=../dbclient --dropbear=../dropbear
============================= test session starts ==============================
platform linux -- Python 3.13.0, pytest-8.4.2, pluggy-1.6.0
rootdir: /usr/src/RPM/BUILD/dropbear-2025.88/test
plugins: rerunfailures-16.1
collecting ...
collecting 28 items
collected 52 items
test_aslr.py . [ 1%]
test_channels.py .x...................RRR.RRRFRRRFRR.
hasher-privd: parent: handle_io: idle time limit (3600 seconds) exceeded
Log with other errors (just once so far): https://git.altlinux.org/beehive/logs/Sisyphus/x86_64/archive/2025/1018/error/dropbear-musl-2025.88-alt2
=========================== short test summary info ============================
FAILED test_channels.py::test_tcpflushout[L-1] - ConnectionRefusedError: [Err...
FAILED test_channels.py::test_tcpflushout[L-4000] - ConnectionRefusedError: [...
FAILED test_channels.py::test_tcpflushout[L-40000] - ConnectionRefusedError: ...
FAILED test_channels.py::test_tcpflushout[R-1] - ConnectionRefusedError: [Err...
FAILED test_channels.py::test_tcpflushout[R-4000] - ConnectionRefusedError: [...
FAILED test_channels.py::test_tcpflushout[R-40000] - ConnectionRefusedError: ...
======== 6 failed, 42 passed, 3 skipped, 1 xfailed, 18 rerun in 48.68s =========
Perhaps, Address already in use converted into timeouts and ConnectionRefusedError. With such errors it's better to not run tests at all because regular false positives would make other possible real errors not noticed.
OK, I think that change has made it progress further. (Confirmed locally here with strace that allow_reuse_address = True sets SO_REUSEADDR, which wasn't being set before).
Maybe on a heavily loaded system the 0.1s sleep isn't adequate. Could you try say 0.5s instead for https://github.com/mkj/dropbear/blob/003c5fcaabc114430d5d14142e95ffdbbd2d19b6/test/test_channels.py#L109-L110 ?
Sorry, I cannot test like this. I don't control our rebuild pipelines and they are only applied to the released packages in the repository. So, just to test I need to update/commit the package which would look just weird to the users and QA. We need to invent how to simulate the interfering load while tests are run, which I did not get the idea yet, but I don't abandon thinking about it.
I was trying to reproduce it.
- Method 1: Lock the qemu-kvm vm with pytests on 2 cpu while run
stress-ng --cpuload on these 2 cpus -- this does not reproduce high load situation. - Method 2: Run qemu in TCG mode -- this easily reproduces
ConnectionRefusedErroreven without stress-ng or cpu locking:test_channels.py .x..................FFFFFF
Now I try your suggestion to modify time.sleep() value on line 109.
time.sleep(1)- tests still fail.time.sleep(10)- tests still fail.- Modify all
time.sleep(0.1)intest_tcpflushouttotime.sleep(1)- tests fail the same way.
Well, the TCG method is not very good since we will run under KVM, so I need to invent the way to reproduce this with KVM to make it more realistic.
Tests start to fail (on normal run inside of KVM vm) even with time.sleep(0.2) (on line 109) or higher sleep values.
I can increase sleep time to time.sleep(0.5) IF I increase sleep value on this line:
r = dbclient(request, f"-{fwd_flag}", "7788:localhost:3344", "sleep 0.5; echo -n done",
If the shell sleep value is smaller than time.sleep then test_tcpflushout tests always fail with ConnectionRefusedError.
If the shell
sleepvalue is smaller than intest_tcpflushoutthe tests always fail withConnectionRefusedError.
Maybe instead of
https://github.com/mkj/dropbear/blob/d193731630a62482855b450daa1d5a5e13a90125/test/test_channels.py#L107-L110
it should be something like
r = dbclient(request, f"-{fwd_flag}", "7788:localhost:3344", "echo -n start; sleep 0.1; echo -n done",
text=True, background=True, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
# wait to let the listener start
assert r.stdout.read(5) == 'start'
Then the final sleep(0.1) might need changing too. Perhaps could poll() in a loop with a sleep or something.
https://github.com/mkj/dropbear/blob/d193731630a62482855b450daa1d5a5e13a90125/test/test_channels.py#L132-L134
After I added echo -n start and repalced time.sleep(0.1) with assert r.stdout.read(5) == 'start' the tests stop failing on TCG vm.