otp ssh: ssh keep alive

This PR aims at introducing the equivalent of openssh ClientAliveInterval/ClientAliveCountMax (https://man.openbsd.org/sshd_config#ClientAliveCountMax) and ServerAliveInterval/ServerAliveCountMax

There's 2 notable differences:

In openssh "The default is 0, indicating that these messages will not be sent to the client.", but in this implementation infinity is used instead of 0 and ssh_options checks that a positive integer is presented;
Keep-alive messages can't be sent during renegotiation, but since this feature acts as a keep-alive and a timeout, an equivalent timeout is established for the renegotiation procedure if alive is enable. This is implemented with a timeout called renegotiation_alive

Nov 27 '24 19:11 alexandrejbr

CT Test Results

2 files 29 suites 20m 28s ⏱️ 485 tests 479 ✅ 6 💤 0 ❌ 1 690 runs 1 664 ✅ 26 💤 0 ❌

Results for commit 21018024.

:recycle: This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

Nov 27 '24 19:11 github-actions[bot]

@u3s this a feature I promised long time ago, if you are still interested in it please continue reading :)

I know you are short on time, but if you could give me some tips on how to create some tests for this would be appreciated.

I wanted to add tests at least these 3 scenarios:

Normal scenario client and server send alive messages (This one I have one idea how to test which would be using the ssh_dbg module, ssh_dbg:on([ssh_messages]) reveals the sending of the messages but it's kind of ugly to use a regex to get the keep alive message);
Timeout scenario. I would like that the server or client become not responsive and I would like to check that the connections is terminated (this one we have tested before suspending the openssh process but it's not so nice);
The renegotiation scenario in which the during the renegotiation the peer becomes unreachable and the timeout should have effect.

Nov 27 '24 19:11 alexandrejbr

we would like to proceed with that one
can you convert it into PR? so we could start testing it and discussing implementation details
for testing, my 1st guess would be ssh_trpt_lib module and do it kind of whitebox way

Jun 12 '25 06:06 u3s

Sounds good. I’ll rebase and start working on the tests.

Jun 12 '25 14:06 alexandrejbr

@u3s I have just pushed changes that address most of the comments.

There's 2 tests now in place. I started to add a third test that checks that the renegotiation_alive timeout works, but I could not start the renegotiation. I left a comment on the test and mentioned you there to see if you can try to help spot what am I doing wrong.

Aug 13 '25 14:08 alexandrejbr

@u3s a colleague just brought to my attention that we are spending more time that we would need detecting that a connection is unstable if the rekey triggers. The reason for that is that we don't deduct that time spent since the last alive from the timer we set for the rekey.

I am happy to open a follow up PR fixing this after this one is merged as we seem to be in a very good state with the current PR. What do you think? Does this sound reasonable?

Nov 04 '25 08:11 alexandrejbr

@u3s a colleague just brought to my attention that we are spending more time that we would need detecting that a connection is unstable if the rekey triggers. The reason for that is that we don't deduct that time spent since the last alive from the timer we set for the rekey.

I am happy to open a follow up PR fixing this after this one is merged as we seem to be in a very good state with the current PR. What do you think? Does this sound reasonable?

You mean a situation when for example:

a peer with alive option (e.g {3, 100})
sends 2 out of 3 probes (this might take ~290ms)
then same peer makes a local decision about renegotiating keys
this cancels the normal alive timeout
and starts renegotiation timeout for 300ms
as a result detecting a dead connection will take 590ms instead of 300ms which user might expect

Yes. Would be nice to iron out this case in a separate PR.

Nov 13 '25 14:11 u3s

@u3s a colleague just brought to my attention that we are spending more time that we would need detecting that a connection is unstable if the rekey triggers. The reason for that is that we don't deduct that time spent since the last alive from the timer we set for the rekey. I am happy to open a follow up PR fixing this after this one is merged as we seem to be in a very good state with the current PR. What do you think? Does this sound reasonable?

You mean a situation when for example:

a peer with alive option (e.g {3, 100})

sends 2 out of 3 probes (this might take ~290ms)

then same peer makes a local decision about renegotiating keys

this cancels the normal alive timeout

and starts renegotiation timeout for 300ms

as a result detecting a dead connection will take 590ms instead of 300ms which user might expect

Yes. Would be nice to iron out this case in a separate PR.

Ok. I will open a PR for this soon.

Nov 13 '25 15:11 alexandrejbr

great. maybe adjust testcases in ssh_protocol_SUITE, so they verify that disconnect happens as quick as expected? op({expect,timeout,E}, S0) in ssh_trpt_lib might be the tool for the job?

Nov 13 '25 16:11 u3s

otp otp copied to clipboard

ssh: ssh keep alive

CT Test Results

Artifacts

otp
otp copied to clipboard