ztunnel icon indicating copy to clipboard operation
ztunnel copied to clipboard

SQL Server Multi-Subnet Cluster connection failures

Open kppullin opened this issue 10 months ago • 5 comments

Summary

We observe connection failures to SQL Server Multi-Subnet Clusters when the sql client application is running inside an ambient mesh and connecting to SQL Servers outside of the mesh.

I think the cause is due to ztunnel's TCP proxy "optimistically" completing TCP handshakes between the downstream client (i.e. dotnet sql client) and ztunnel prior to completing & validating the handshake between the upstream (i.e. sql server multi subnet cluster) and ztunnel.

Details

My understanding of how SQL Server Multi-Subnet Clusters operate, is that they actively coordinate amongst each other so that only a single instance is listening for connections at any given time. SQL Server clients discover the cluster's servers via a DNS record, of which returns a set of IP addresses bound to the cluster (example response in a screenshot below).

After the client obtains the list of cluster IPs via DNS, it then, in parallel, attempts to open a connection to each of these IPs. In a normal connection flow, once the client successfully completes a TCP handshake to the single, active IP, it then uses the connection for the sql client session and discards the other connection attempts.

Now, when mixing in the ambient mesh, ztunnel's pass-through TCP proxy "optimistically" & "immediately" (10s of microseconds) completes the TCP handshake for all connection attempts to all cluster IPs, regardless if the upstream connection is possible.

Due to race conditions, the downstream side of the proxy (between the sql client & ztunnel) often completes its TCP handshake for the IP of the inactive server. At this point, the sql client believes it has successfully established a connection to the inactive server, and drops any other connection to the active server. Subsequent writes to this otherwise invalid connection fails as the upstream side of the connection cannot be established.

Random Thoughts on Possible Solutions

There's probably better ideas, but to capture various options and references, here's some thoughts on how this might be resolved:

This could be solved if support is added for `excludeOutboundPorts (ref: https://github.com/istio/istio/issues/49829). We'd exclude the sql server port and force all other traffic through the mesh.

Another possible solution is to verify the upstream connection handshake prior completing to the downstream handshake... though I'd assume this would have other impacts and downsides.

Given that our workloads connection both to sql server and other services inside the mesh, we'd prefer not to opt these workloads out of the mesh entirely via the istio.io/dataplane-mode: none option, but worth a call out.

Packet Captures

DNS

DNS response for the multi subnet cluster.

IP 10.x.x.72 is inactive IP 10.x.x.7 is active

Image

Success Scenario

Here's a typical connection sequence when the mesh is inactive, generated by a test app that opens a connection every ~two seconds.

Successful connection to active server on 10.x.x.7:

Image

Failed connections to inactive server on 10.x.x.72:

Image

Failure Scenario

Here's a connection attempt to the invalid address, 10.x.x.72, for the downstream side of the connection between the sql client and ztunnel. This shows the TCP handshake completing between the client and ztunne, the subsequent data packet, and finally connection termination due to no valid listener on the upstream side.

Image

And corresponding upstream packets between ztunnel and the inactive sql server on 10.x.x.72, which all fail:

Image

kppullin avatar Feb 13 '25 22:02 kppullin

Thanks for the detailed report! The "every connection is accepted" is definitely an area I would love to improve on.

The challenge is Linux's tcp stack doesn't really give this level of control. It's theoretically possible by hooking in to the lower layers (netfilter) but pretty complex - and I I've never seen it done in real world besides some crazy prototypes I have work on

howardjohn avatar Feb 13 '25 22:02 howardjohn

In a perfect world, what behavior would we want? Do users need to be able to choose a certain connection algorithm or policy?

keithmattix avatar Feb 14 '25 00:02 keithmattix

In a perfect world the Linux APIs look roughly like so:

let partial_socket =  listener.partially_accept().await // get SYN but don't SYNACK
let Ok(upstream) = TcpStream::connect(partial_socket.original_dst) else {
  partial_socket.reject()
}
let full_socket = partial_socket.complete();
copy(full_socket, upstream);

IMO

howardjohn avatar Feb 14 '25 01:02 howardjohn

Just as a reference, here's the parallel socket connection logic in the dotnet sql client: https://github.com/dotnet/SqlClient/blob/8d5e4f212921d489c4e556547fa2fd2eb0d4054a/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SNI/SNITcpHandle.cs#L484

I had some hopes there might be a way to work around this on the client side, but not seeing it yet.

kppullin avatar Feb 14 '25 02:02 kppullin

In a perfect world the Linux APIs look roughly like so:

let partial_socket =  listener.partially_accept().await // get SYN but don't SYNACK
let Ok(upstream) = TcpStream::connect(partial_socket.original_dst) else {
  partial_socket.reject()
}
let full_socket = partial_socket.complete();
copy(full_socket, upstream);

IMO

We could do this with raw sockets or something like dpdk but that's a huge lift. To address the problem in the short term, could we support excluding certain outbound ports?

keithmattix avatar Feb 14 '25 03:02 keithmattix

We can't work around this on the client side because MSFT explicitly designed this feature to hit all IPs and connect to the one that responds - I put the MSFT documentation into a different thread - https://github.com/istio/istio/issues/49829#issuecomment-2916724491

This is also independent of the 'MultiSubnetFailover=True' parameter - having it set to False has the same failure pattern presumably because the client sees the failed/partial connection and then doesn't keep trying.

pboehnke-relativity avatar May 28 '25 16:05 pboehnke-relativity

@kppullin Few questions:

  1. Are you currently configuring a ServiceEntry for the SQL servers? If so, could you please provide the manifest?
  2. Are you utilizing an egress gateway/waypoint for the SQL server?

jaellio avatar Aug 19 '25 20:08 jaellio

Hi @jaellio - the configuration used above is plain DNS directly pointing to a SQL Server cluster outside the k8s network. There is no ServiceEntry nor egress gateway configuration in place. That said, I can't think of a reason why we couldn't use these features if required.

kppullin avatar Aug 19 '25 21:08 kppullin

For us the requirement to add a ServiceEntry or other istio configs would be burdensome because we run thousands of sql instances outside of our k8s clusters and need to talk to them all.

pboehnke-relativity avatar Aug 20 '25 01:08 pboehnke-relativity

@pboehnke-relativity and @kppullin - How is this currently working for you in sidecar mode? Are you excluding the port on outbound?

jaellio avatar Aug 20 '25 20:08 jaellio

@jaellio we use a significant number of ports and so the only way we can do this is to exclude the ip address ranges for our sql server deployments

On the sidecar it looks like this:

Image

In our helm charts it is in the base_overrides.yaml under proxy - excludeIPRanges

pboehnke-relativity avatar Aug 20 '25 21:08 pboehnke-relativity

@pboehnke-relativity @kppullin FYI This is the feature review doc I have created. We discussed it in the last community meeting and are discussing it asynchronously in comments and on slack. Please share your thoughts and context!

https://docs.google.com/document/d/1cNA376XwRSsmB2NXNee30iZTw8gaAj8t8I4QDEFK99s/edit?usp=sharing

jaellio avatar Aug 22 '25 22:08 jaellio

@jaellio Thank you! The proposed solution looks a viable solution for our needs. We'd use the ambient.istio.io/resolution-strategy: parallel-first-healthy option, as that's close (identical?) to what the sql server client does itself.

For the other resolution strategies, I'd be curious what the timeouts would be prior to trying the next IP and what use cases the other strategies solve... perhaps the "complexity" of those options can be deferred.

kppullin avatar Aug 26 '25 17:08 kppullin