fix: Ensure SSH connections are closed after each command execution
This commit addresses an issue where SSH connections were not being properly closed after each command execution, leading to timeout errors during the kamal deploy process.
Problem
When executing commands using the execute method in SSHKit::Runner::Parallel, SSH connections were left open, causing subsequent commands to fail with timeout errors. This issue was particularly evident when running kamal build deliver after kamal registry login during the normal kamal deploy process
Solution
The execute method in the SSHKit::Runner::Parallel::CompleteAll module has been modified to ensure that all SSH connections are closed after each command execution. This is achieved by adding an ensure block to the thread creation logic, which calls SSHKit::Backend::Netssh.pool.close_connections after each command, regardless of whether an exception occurs.
Changes
- Added an
ensureblock to theexecutemethod inSSHKit::Runner::Parallel::CompleteAllto close SSH connections after each command execution.
Impact
This change ensures that SSH connections are properly closed, preventing timeout errors and improving the reliability of the kamal deploy process.
Testing
Tested the changes by running kamal deploy commands. Verified that the timeout errors no longer occur and the deployment process completes successfully.
Closes #857
For some strange reason the test on Ruby 3.1 failed, it looks like it might have been just a glitch in the test procedures as the code should not affect only one particular version of ruby but all versions.
ensure
SSHKit::Backend::Netssh.pool.close_connections
Would it be possible to run the tests again?
I've kicked that off. I'm not sure though about this change. Creating the SSH connections can be expensive especially with large numbers of servers to deploy to.
We configure keepalives with an interval of 30 seconds on the connections, so that generally should stop them from timing out. Maybe you could try reducing the keepalive interval and see if that makes any difference?
Thanks for the suggestion, I did try to decrease the .ssh/config ServerAliveInterval to 10, unfortunately that did not fix the issue.
:+1: on this one, as it's pretty much similar to delano/rye#63 I faced a while ago, and I think delano/rye#38 has more context on this
cc @djmb @jfanals
@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from .ssh/config.
Could you try:
ssh:
keepalive_interval: 10
log_level: debug
Setting the log_level to debug might also give us some useful feedback.
@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from
.ssh/config.Could you try:
ssh: keepalive_interval: 10 log_level: debugSetting the log_level to debug might also give us some useful feedback.
I had a similar problem, which brought me to this issue. Unfortunately setting the keepalive_interval and keepalive parameters is not possible from the configuration file, as they are hardcoded.
❯ bin/kamal config
ERROR (Kamal::ConfigurationError): ssh: unknown key: keepalive_interval
The fix for me was to increase the ufw imposed limit of 6 per 30 second.
@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from
.ssh/config. Could you try:ssh: keepalive_interval: 10 log_level: debugSetting the log_level to debug might also give us some useful feedback.
I had a similar problem, which brought me to this issue. Unfortunately setting the
keepalive_intervalandkeepaliveparameters is not possible from the configuration file, as they are hardcoded.❯ bin/kamal config ERROR (Kamal::ConfigurationError): ssh: unknown key: keepalive_intervalThe fix for me was to increase the
ufwimposed limit of 6 per 30 second.
@plattenschieber how did you apply your fix? Is it through the server?
Yes, just sshed into the machine and updated the firewall
Yes, just
sshed into the machine and updated the firewall
Mine does not have ufw active, so it might not be the same issue.
@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from
.ssh/config.Could you try:
ssh: keepalive_interval: 10 log_level: debugSetting the log_level to debug might also give us some useful feedback.
@jfanals - we overwrite the keepalive_interval in the Kamal config, so I'm not sure it will pick up that from
.ssh/config.Could you try:
ssh: keepalive_interval: 10 log_level: debugSetting the log_level to debug might also give us some useful feedback.
I tried setting the log_level to debug and keepalive_interval to 10 by manually updating the ssh.rb at /opt/homebrew/lib/ruby/gems/3.3.0/gems/kamal-2.2.2/lib/kamal/configuration, but it's still not working. It looks like when my NextJS app is building for a long time while it streams the logs of the build, the keepalive ping is not running.
INFO [b2d3a646] Finished in 339.110 seconds with exit status 0 (successful).
D, [2024-10-28T07:17:02.043992 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:02.046520 #98262] DEBUG -- socket[71c]: queueing packet nr 17 type 90 len 32
D, [2024-10-28T07:17:02.046768 #98262] DEBUG -- socket[71c]: read 84 bytes
D, [2024-10-28T07:17:02.046836 #98262] DEBUG -- socket[71c]: received packet nr 24 type 82 len 16
I, [2024-10-28T07:17:02.046886 #98262] INFO -- net.ssh.connection.session[7bc]: global request failure
D, [2024-10-28T07:17:02.046914 #98262] DEBUG -- net.ssh.connection.keepalive[7f8]: keepalive response successful. Missed 0 keepalives
D, [2024-10-28T07:17:02.047055 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:12.054207 #98262] INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 0
I, [2024-10-28T07:17:12.054477 #98262] INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:12.054771 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:12.055230 #98262] DEBUG -- socket[71c]: queueing packet nr 18 type 80 len 32
D, [2024-10-28T07:17:12.055601 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:22.059028 #98262] INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 1
I, [2024-10-28T07:17:22.059350 #98262] INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:22.059645 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:22.060097 #98262] DEBUG -- socket[71c]: queueing packet nr 19 type 80 len 32
D, [2024-10-28T07:17:22.060466 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:32.064610 #98262] INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 2
I, [2024-10-28T07:17:32.064725 #98262] INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:32.064872 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:32.065100 #98262] DEBUG -- socket[71c]: queueing packet nr 20 type 80 len 32
D, [2024-10-28T07:17:32.065300 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:42.069792 #98262] INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 3
I, [2024-10-28T07:17:42.070080 #98262] INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:42.070405 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:42.070955 #98262] DEBUG -- socket[71c]: queueing packet nr 21 type 80 len 32
D, [2024-10-28T07:17:42.071385 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:17:52.075686 #98262] INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 4
I, [2024-10-28T07:17:52.075939 #98262] INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:17:52.076247 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:17:52.076688 #98262] DEBUG -- socket[71c]: queueing packet nr 22 type 80 len 32
D, [2024-10-28T07:17:52.077079 #98262] DEBUG -- socket[71c]: sent 100 bytes
I, [2024-10-28T07:18:02.078444 #98262] INFO -- net.ssh.connection.keepalive[7f8]: sending keepalive 5
I, [2024-10-28T07:18:02.078565 #98262] INFO -- net.ssh.connection.session[7bc]: sending global request [email protected]
D, [2024-10-28T07:18:02.078688 #98262] DEBUG -- socket[71c]: using encrypt-then-mac
D, [2024-10-28T07:18:02.078861 #98262] DEBUG -- socket[71c]: queueing packet nr 23 type 80 len 32
D, [2024-10-28T07:18:02.078970 #98262] DEBUG -- socket[71c]: sent 100 bytes
D, [2024-10-28T07:18:08.075602 #98262] DEBUG -- socket[71c]: read 0 bytes
D, [2024-10-28T07:18:08.076116 #98262] DEBUG -- socket[71c]: read 0 bytes
Finished all in 412.9 seconds
ERROR (IO::TimeoutError): Exception while executing on host 54.66.xxx.xx: Blocking operation timed out!
This PR actually fixed my issue. It might be worth considering merging this.
I'm going to close this one as we shouldn't need to re-create the connections for each command. There's something wrong here alright and it would be good to get to the bottom of it, but I don't think this brute force approach is the way to go.
our team has 30% chance of closing stream as of the latest version of kamal
https://github.com/basecamp/kamal/issues/1432