kube-router TCP doesn't terminate gracefully if node is down

What happened?

Before a pod terminates we make the pod unready so that new connections doesn't get routed to it. So, only nodes which NAT Service ExternalIP to pod IP will have the pod IP entry in the IPVS table. During this time if the node which did the NAT of ExternalIP to pod goes down then there is no way to reach the terminating pod.

What did you expect to happen?

Even if other nodes go down as long as the pod is not terminated there should be a way to reach it.

How can we reproduce the behavior you experienced?

Create a cluster with 2 nodes which are in two different regions.
Service has DSR and maglev enabled

apiVersion: v1
kind: Service
metadata:
  annotations:
    kube-router.io/service.dsr: "tunnel"
    kube-router.io/service.scheduler: "mh"
    kube-router.io/service.schedflags: "flag-1,flag-2"

There are 3 pods behind this service. All the pods are running on eqx-sjc-kubenode1-staging

root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get svc,endpoints
NAME                       TYPE        CLUSTER-IP       EXTERNAL-IP     PORT(S)    AGE
service/debian-server-lb   ClusterIP   192.168.97.188   199.27.151.10   8099/TCP   6d7h

NAME                         ENDPOINTS                                      AGE
endpoints/debian-server-lb   10.36.0.3:8099,10.36.0.5:8099,10.36.0.6:8099   6d7h

root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get pods -o wide
NAME                            READY   STATUS    RESTARTS   AGE    IP              NODE
debian-server-8b5467777-cbwt2   1/1     Running   0          18m    10.36.0.6       eqx-sjc-kubenode1-staging 
debian-server-8b5467777-vts6l   1/1     Running   0          2d5h   10.36.0.3       eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv   1/1     Running   0          19m    10.36.0.5       eqx-sjc-kubenode1-staging

IPVS entries are successfully applied by kube-router

root@eqx-sjc-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn   
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
  -> 10.36.0.6:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0         
  -> 10.36.0.6:8099               Tunnel  1      0          0 

root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn       
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
  -> 10.36.0.6:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0         
  -> 10.36.0.6:8099               Tunnel  1      1          0

In all the 3 pods start a TCP server on port 8099 using nc -lv 0.0.0.0 8099
Create a session from client which is closer to tlx-dal-kubenode1-staging using nc <service-ip> 8099
Make a pod unready. This keeps pod IP entry in IPVS for tlx-dal-kubenode1-staging only

NAME                            READY   STATUS    RESTARTS   AGE    IP              NODE
debian-server-8b5467777-cbwt2   0/1     Running   0          18m    10.36.0.6       eqx-sjc-kubenode1-staging 
debian-server-8b5467777-vts6l   1/1     Running   0          2d5h   10.36.0.3       eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv   1/1     Running   0          19m    10.36.0.5       eqx-sjc-kubenode1-staging 

root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn       
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
  -> 10.36.0.6:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0         
  -> 10.36.0.6:8099               Tunnel  0      1          0   

root@tlx-dal-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -Lcn 
IPVS connection entries
pro expire state       source             virtual            destination
TCP 14:58  ESTABLISHED 103.35.125.24:41876 199.27.151.10:8099 10.36.0.6:8099

root@eqx-sjc-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn  
TCP  192.168.97.188:8099 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Masq    1      0          0         
  -> 10.36.0.5:8099               Masq    1      0          0         
FWM  3754 mh (mh-fallback,mh-port)
  -> 10.36.0.3:8099               Tunnel  1      0          0         
  -> 10.36.0.5:8099               Tunnel  1      0          0

shutdown tlx-dal-kubenode1-staging. Now the connection is completely broken

System Information (please complete the following information)

Kube-Router Version (kube-router --version): [2.5.0, built on 2025-02-14T20:20:43Z, go1.23.6]
Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]

--kubeconfig=/usr/local/kube-router/kube-router.kubeconfig 
--run-router=true 
--run-firewall=true 
--run-service-proxy=true 
--v=3 
--peer-router-ips=103.35.124.1 
--peer-router-asns=65322 
--cluster-asn=65321 
--enable-ibgp=false 
--enable-overlay=false 
--bgp-graceful-restart=true 
--bgp-graceful-restart-deferral-time=30s 
--bgp-graceful-restart-time=5m 
--advertise-external-ip=true 
--ipvs-graceful-termination 
--runtime-endpoint=unix:///run/containerd/containerd.sock 
--enable-ipv6=true 
--routes-sync-period=1m0s 
--iptables-sync-period=1m0s 
--ipvs-sync-period=1m0s 
--hairpin-mode=true 
--advertise-pod-cidr=true

Kubernetes Version (kubectl version) : 1.29.14
Cloud Type: [e.g. AWS, GCP, Azure, on premise] onprem
Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] manual
Kube-Router Deployment Type: [e.g. DaemonSet, System Service] on host
Cluster Size: [e.g. 200 Nodes] 2 nodes
kernel version: 5.10.0-34-amd64

Jun 11 '25 13:06 anupamdialpad

Apart from checking the open connection in IPVS table does it make sense to check the conntrack table on the host to see if there are any connection established with the pod?

https://github.com/cloudnativelabs/kube-router/blob/85e429e9c72b2bc7de93b5f1bcce20e7c924386d/pkg/controllers/proxy/network_service_graceful.go#L111-L113

If the deletion of the endpoint in the IPVS table is avoided, even when the other node that initially handled the TCP SYN connection goes down still traffic will continue to be correctly routed to the backend pod on its current host. Persistent IPVS table entry, combined with Maglev hashing, ensures packets reach the right backend pod.

I could see the conntrack when the connection was open (103.35.124.22 is the ip of tlx-dal-kubenode1-staging)

root@eqx-sjc-kubenode1-staging: $ conntrack -L -d 10.36.0.6
unknown  4 378 src=103.35.124.22 dst=10.36.0.6 [UNREPLIED] src=10.36.0.6 dst=103.35.124.22 mark=0 use=1
conntrack v1.4.7 (conntrack-tools): 1 flow entries have been shown.

Jun 12 '25 13:06 anupamdialpad

The xml formatted output has more information conntrack -L -o xml -d 10.36.0.6

<?xml version="1.0" encoding="utf-8"?>
<conntrack>
  <flow>
    <meta direction="original">
      <layer3 protonum="2" protoname="ipv4">
        <src>103.35.124.22</src>
        <dst>10.36.0.6</dst>
      </layer3>
      <layer4 protonum="4" protoname="unknown"></layer4>
    </meta>
    <meta direction="reply">
      <layer3 protonum="2" protoname="ipv4">
        <src>10.36.0.6</src>
        <dst>103.35.124.22</dst>
      </layer3>
      <layer4 protonum="4" protoname="unknown"></layer4>
    </meta>
    <meta direction="independent">
      <timeout>597</timeout>
      <mark>0</mark>
      <use>1</use>
      <id>3724150459</id>
      <unreplied/>
    </meta>
  </flow>
</conntrack>

Layer 4 protonum is 4 which means IPIP protocol https://elixir.bootlin.com/linux/v6.13.7/source/include/uapi/linux/in.h#L36 . I think the check can something like: if the configuration is DSR and maglev check if there exists a conntrack entry with layer 4 = IPIP and destination IP = pod IP

Jun 13 '25 07:06 anupamdialpad

Thanks for doing a lot of the leg work on this @anupamdialpad. I think that you're discovering a lot of interesting edge-cases with maglev hashing in kube-router. It's probably not one of the most used features inside kube-router, but I'm none-the-less glad that you are exercising it and trying to make it better for your use case and others.

Regarding this issue, I just want to make sure that I'm fully understanding your use-case and expectation. So, let me repeat back what I think I get from the information that you've put here and then you tell me if its what you are looking for.

You have a client connected to a maglev service endpoint that exists as pod-1 on node-1. This connection is established via node-2.
You do something that marks pod-1 as unready. However, the pod is not dead and it is still servicing the above connection from the client. a. In IPVS the endpoint is now marked as expiring, and it's weight for new connections is reduced to 0. b. However, this state is only known on node-2. On node-1 as there were no active IPVS connections, the endpoint is already withdrawn.
node-2 is shut down. At this point, the client's connection is disconnected even though it could still be serviced via node-1 via ECMP.

So instead, you would like Step 2a and beyond to progress as:

You do something that marks pod-1 as unready. However, the pod is not dead and it is still servicing the above connection from the client. b. node-1 checks both active IPVS connections as well as its conntrack table to see if there is an established connection to an endpoint on its node. Seeing an active connection to the endpoint, it leaves the IPVS entry in graceful expiry mode.
node-2 is shut down. At this point, the client retries its connection and gets routed to node-1. a. node-1 still has an endpoint enabled in IPVS b. Even though the endpoint doesn't have any weight because of maglev and sloppy_tcp setting (see: #1860) it is still correctly hashed to the expiring endpoint c. Client successfully re-establishes its connection via node-1 and continues to communicate with the service

Did I get that correct? Did I miss anything?

Jun 15 '25 17:06 aauren

Thanks @aauren for looking at it! Yes, you got it correctly.

Jun 16 '25 05:06 anupamdialpad

So for this, I think that your solution solves it neatly as long as you only have a 2 node cluster. Unfortunately, I don't see how you could solve the problem in a multi-node cluster. For instance, if you had 3 nodes:

node-1 - contains pod-1 (the workload) with the service endpoint
node-2 - is a generic worker node that happens to be advertising the service
node-3 - is a generic worker node that also is advertising the service

Client establishes a connection to pod-1 via node-2. pod-1 becomes unready and the service goes into graceful termination.

If we were to merge the connection tracking change in addition to checking the IPVS statistics, then I would imagine the following would be true:

node-1 - would enter graceful termination for the service because it contains an active connection tracking state for the service
node-2 - would enter graceful termination for the service because it contains IPVS connection state for the client
node-3 - would withdraw the endpoint from IPVS because it doesn't contain the endpoint nor did it have an active client connection
(same for nodes 4-N if you had more nodes)

node-2 is shutdown

Client retries its existing connection and is BGP hashed to node-3 (or any other node other than node-1) and it will still result in a connection refused right?

It seems to me that you have some pretty specific requirements regarding TCP connections. In most app environments I've been exposed to, this is usually handled by re-establishing the TCP connection and having good retry and backoff logic in the client. Is that not something that is viable in your use-case?

Jun 16 '25 23:06 aauren

Client retries its existing connection and is BGP hashed to node-3 (or any other node other than node-1) and it will still result in a connection refused right?

Correct, the connection will fail.

this is usually handled by re-establishing the TCP connection and having good retry and backoff logic in the client. Is that not something that is viable in your use-case?

Unfortunately the client is not within our control :( Even if the client retires it will not be able to connect with the same backend pod after node-2 (in your example) goes down, right? Since only node-2 had the pod IP entry with weight 0 in the IPVS table, other nodes don't have pod-1 entry in the IPVS table.

Jun 17 '25 06:06 anupamdialpad

Another idea I have is to delay the removal of pod endpoint from the IPVS table for the duration of its terminationGracePeriodSeconds or until the pod is terminated. This ensures that all nodes retain the pod endpoint but set its weight to 0. As a result, existing connections can continue to work.

Jun 17 '25 10:06 anupamdialpad

Even if the client retires it will not be able to connect with the same backend pod after node-2 (in your example) goes down, right? Since only node-2 had the pod IP entry with weight 0 in the IPVS table, other nodes don't have pod-1 entry in the IPVS table.

I guess I was wondering whether sloppy_tcp would override this functionality or not. If IPVS won't allow the session to transition nodes when the endpoint weight is 0, then I'm not sure that there is a path forward at all. Because any other node the client tries to connect to will have to have the endpoint weight set to 0.

Do you know if this is true?

Jun 18 '25 12:06 aauren

Looking at the sloppy_tcp patch it seems that this feature just allows TCP ACK to behave similar to TCP SYN for getting the backend server

 	/* No !th->ack check to allow scheduling on SYN+ACK for Active FTP */
 	rcu_read_lock();
-	if (th->syn &&
+	if ((th->syn || sysctl_sloppy_tcp(ipvs)) && !th->rst &&
 	    (svc = ip_vs_service_find(net, af, skb->mark, iph->protocol,
 				      &iph->daddr, th->dest))) {
 		int ignored;

Because any other node the client tries to connect to will have to have the endpoint weight set to 0.

Yes this statement needs to be true for this scenario to work. That's what I was thinking that instead of removing the endpoints when active/inactive connection count goes to 0 if the logic sets the weight to 0 when pod is unready then removes the endpoints only when the pod is terminated or terminationGracePeriodSeconds has expired.

Jun 20 '25 05:06 anupamdialpad

@aauren any thoughts on my previous comment?

Jul 02 '25 12:07 anupamdialpad

I guess that I'm still a bit dubious about whether or not setting the weight to 0, even with maglev caching, will allow you to get routed to that backend. Or that doing it this way won't break other use-cases that maybe rely on the endpoint being completely removed.

But you're welcome to try it out and let us know how it goes. If it works for you, feel free to submit a PR. If the PR still passes the upstream k8s conformance tests and doesn't break anything obvious it could potentially be accepted depending on how much logic it does or does not introduce.

Jul 04 '25 19:07 aauren

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

Aug 04 '25 03:08 github-actions[bot]

This issue was closed because it has been stale for 5 days with no activity.

Aug 10 '25 03:08 github-actions[bot]