Race condition on listening port when SSH daemon is restarted
Describe the bug When SSH daemon is restarted the Erlang port that handles the OS-level socket is closed asynchronously w.r.t the return from ssh:stop_daemon(), sometimes the new daemon instance can try to open a new listening socket on the same OS port before the old socket is closed (i.e. before TCP port reaches TIME_WAIT). This results in {error, eaddrinuse}, which may be confusing in some use cases.
To Reproduce The issue could be reproduced on our system when both CPU and network are heavily loaded (a few hundred Jenkins containers each performing network-related tasks), I could not create such conditions on a local machine.
Expected behavior No {error, eaddrinuse} is observed when SSH daemon is restarted (assuming no other application tries to snatch the port).
Affected versions OTP-25.3.2.6, OTP-26
Additional context None
sorry for delay. I hope this can be investigated soon.
Same issue, but OTP 24.2 (Ubuntu 22.04 if this is important).
It was decided to postpone work related to ssh supervision tree. I would like to revise how supervision tree is implemented and then decide how it should be fixed.
I would like to clarify:
- which ssh:daemon function are you using?
- are providing it with Port number specified as integer?
In my case, I used ssh:daemon/3, and the port was an integer.
can you try to rebuilt your OTP with this change https://github.com/erlang/otp/pull/8663 and verify if issue is still present?
for some reason ssh:daemon/3 gives up after 1st failure. whereas when socket is provided by user with ssh:daemon/2 several attempts of creating a listen socket are made before giving up.
I would like to align both behaviors in 1st step.
PR-8663 is merged. I think this code alignment for ssh daemon started with TCP port specified might improve behavior in area reported in that issue.
I will close the issue for now. Please re-open it if issue is still present with PR-8663 (planned to be released in OTP-27.1, OTP-26.2.5.3, OTP-25.3.2.14).