dstack
dstack copied to clipboard
[Bug]: Fleet instance becomes unreachable if the SSH connection exceeds 3 seconds
Problem:
We have at least two problems with SSH fleet instances:
- [Inderect problem]
dstackserver connects to SSH fleet instances every 4 seconds to check shim health. This alone is a big problem if there will be a large number of hosts even if SSH connection is fast. The ideal solution would be to cache connections and use a kinda of a SSH connection pool. - [Direct problem] If SSH connection takes longer than 3 seconds (can be quite easily),
dstackmarks the instance unreachable and fails any running jobs. This is a blocker for usingdstackwith any hosts that may have unstable or poor network connection.
Solution:
- [Preferred solution] Implement a pool of SSH connections
- [Workaround] Increase SSh connection timeout from 3 seconds to at least 15 seconds