SSH.NET
SSH.NET copied to clipboard
The server response does not contain an SSH identification string.
When testing from a developer machine, the connection to the remote server is successful. But when deployed to an Azure function I get this error upon executing the Connect() method. The referenced ietf document is "greek" to me.
Renci.SshNet.Common.SshConnectionException: The server response does not contain an SSH identification string. The connection to the remote server was closed before any data was received. More information on the Protocol Version Exchange is available here: https://tools.ietf.org/html/rfc4253#section-4.2
I've reproduced this problem here: https://ci.appveyor.com/project/drieseng/ssh-net/builds/48584754
@Rob-Hague do you have any idea what might have happened?
In #1250 it looks like the same as https://github.com/sshnet/SSH.NET/pull/1220#issuecomment-1774128167
I had guessed that that was related to the connections being re-established too quickly. I started looking at SO_REUSEADDR and SO_REUSEPORT, but I have quite a lot of learning to do there. I'm not sure whether @rkreisel's problem would have the same cause or whether it is something different.
Probably the best thing would be to get a packet capture by running tcpdump
on the docker instance... but I wouldn't know how to do that either 🙂
FWIW I've been troubleshooting a similar problem, i.e. works without issues locally, and failed intermittently in Azure.
The problem was that Azure Functions or App Services (except ASE/Isolated tier, that cost an arm and a leg) have a list of outbound IP addresses it can "pick" for outbound connections. The IP selected by Azure can change across Function execution, the actual issue was that some IPs were allowed and some were blacklisted (the company hosting the SFTP server was unaware that some Azure IPs were locked, until I showed them the packet capture). These IPs could have been used by other tenants - before being assigned to your Function App - engaging in "suspicious" activities.
Because the issue was intermittent and similar tickets pointed to SSH.NET potentially not handling connections properly(?), I initially looked into the SSH.NET code but after debugging it extensively I came to the conclusion it had nothing to do with it, then I started to look at the network (should have started there).
This was manifesting itself by a FIN/ACK packet sent - by the remote site - immediately after the TCP handshake.
The client starts the SSH protocol version exchange unaware that the server is initiating the TCP connection termination (screenshot below).
Hence why the server never returns its identification string since it's closing the connection.
You can run a packet capture from Azure by upgrading to a premium plan temporarily (if running on a consumption plan).
- Go to "Change App Service Plan" -> Select "Function Premium"
- Go to "Diagnose and solve problems" -> "Collect Network Trace"
- Use Wireshark or similar to review the packet capture
A few options to solve this particular problem are:
- Whitelist the function apps' data center IPs, if you control the SFTP server or can convince the vendor (it's a long list)
https://learn.microsoft.com/en-us/azure/azure-functions/ip-addresses?tabs=portal#data-center-outbound-ip-addresses
- Route traffic from your Function App to a network appliance with a static IP (NAT gateway, outbound load balancer etc.)
https://learn.microsoft.com/en-us/azure/nat-gateway/nat-overview
- others...
After updating to 2024.1 we started to notice the same error message when trying to connect to an AWS sftp server, but only about 50% of the time, seemingly at random.
After a lot of digging in stuff I really don't understand that well, my current understanding is that it seems like there is some race condition when the protocol version exchange is sent very close in time after an ACK related to the initial connection:
Here we see wireshark logs of first a failed connection attempt at about 11:53:50, which ends up in a loop of some retransmission requests until the server gives up on us. I know too little about the insides of TCP connections to know who is to blame for the parts not agreeing any longer, but somehow they end up talking past each other.
The connection attempt at 11:58:50 ends up working tough and everything is fine...
After some tinkering I found that this small change "fixes" the problem:
I'm not suggesting that this is a long term solution for anyone, but perhaps this sheds enough light into the problem so that someone actually can solve the bug before we need to upgrade to a new version.
Until then this hack seems to have resolved our immediate issues.