dstack icon indicating copy to clipboard operation
dstack copied to clipboard

[Bug]: Fleet instance becomes unreachable if the SSH connection exceeds 3 seconds

Open peterschmidt85 opened this issue 10 months ago • 0 comments

Problem:

We have at least two problems with SSH fleet instances:

  1. [Inderect problem] dstack server connects to SSH fleet instances every 4 seconds to check shim health. This alone is a big problem if there will be a large number of hosts even if SSH connection is fast. The ideal solution would be to cache connections and use a kinda of a SSH connection pool.
  2. [Direct problem] If SSH connection takes longer than 3 seconds (can be quite easily), dstack marks the instance unreachable and fails any running jobs. This is a blocker for using dstack with any hosts that may have unstable or poor network connection.

Solution:

  1. [Preferred solution] Implement a pool of SSH connections
  2. [Workaround] Increase SSh connection timeout from 3 seconds to at least 15 seconds

peterschmidt85 avatar Feb 23 '25 20:02 peterschmidt85