TCP doesn't terminate gracefully if node is down
What happened?
Before a pod terminates we make the pod unready so that new connections doesn't get routed to it. So, only nodes which NAT Service ExternalIP to pod IP will have the pod IP entry in the IPVS table. During this time if the node which did the NAT of ExternalIP to pod goes down then there is no way to reach the terminating pod.
What did you expect to happen?
Even if other nodes go down as long as the pod is not terminated there should be a way to reach it.
How can we reproduce the behavior you experienced?
- Create a cluster with 2 nodes which are in two different regions.
- Service has DSR and maglev enabled
apiVersion: v1
kind: Service
metadata:
annotations:
kube-router.io/service.dsr: "tunnel"
kube-router.io/service.scheduler: "mh"
kube-router.io/service.schedflags: "flag-1,flag-2"
- There are 3 pods behind this service. All the pods are running on
eqx-sjc-kubenode1-staging
root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get svc,endpoints
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/debian-server-lb ClusterIP 192.168.97.188 199.27.151.10 8099/TCP 6d7h
NAME ENDPOINTS AGE
endpoints/debian-server-lb 10.36.0.3:8099,10.36.0.5:8099,10.36.0.6:8099 6d7h
root@gce-del-km-staging-anupam:~/anupam/manifests $ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
debian-server-8b5467777-cbwt2 1/1 Running 0 18m 10.36.0.6 eqx-sjc-kubenode1-staging
debian-server-8b5467777-vts6l 1/1 Running 0 2d5h 10.36.0.3 eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv 1/1 Running 0 19m 10.36.0.5 eqx-sjc-kubenode1-staging
- IPVS entries are successfully applied by kube-router
root@eqx-sjc-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
-> 10.36.0.6:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
-> 10.36.0.6:8099 Tunnel 1 0 0
root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
-> 10.36.0.6:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
-> 10.36.0.6:8099 Tunnel 1 1 0
- In all the 3 pods start a TCP server on port 8099 using
nc -lv 0.0.0.0 8099 - Create a session from client which is closer to
tlx-dal-kubenode1-stagingusingnc <service-ip> 8099 - Make a pod unready. This keeps pod IP entry in IPVS for
tlx-dal-kubenode1-stagingonly
NAME READY STATUS RESTARTS AGE IP NODE
debian-server-8b5467777-cbwt2 0/1 Running 0 18m 10.36.0.6 eqx-sjc-kubenode1-staging
debian-server-8b5467777-vts6l 1/1 Running 0 2d5h 10.36.0.3 eqx-sjc-kubenode1-staging
debian-server-8b5467777-wxfrv 1/1 Running 0 19m 10.36.0.5 eqx-sjc-kubenode1-staging
root@tlx-dal-kubenode1-staging:~ $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
-> 10.36.0.6:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
-> 10.36.0.6:8099 Tunnel 0 1 0
root@tlx-dal-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -Lcn
IPVS connection entries
pro expire state source virtual destination
TCP 14:58 ESTABLISHED 103.35.125.24:41876 199.27.151.10:8099 10.36.0.6:8099
root@eqx-sjc-kubenode1-staging:~/anupam/kr-ecv $ ipvsadm -L -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.97.188:8099 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Masq 1 0 0
-> 10.36.0.5:8099 Masq 1 0 0
FWM 3754 mh (mh-fallback,mh-port)
-> 10.36.0.3:8099 Tunnel 1 0 0
-> 10.36.0.5:8099 Tunnel 1 0 0
- shutdown
tlx-dal-kubenode1-staging. Now the connection is completely broken
System Information (please complete the following information)
- Kube-Router Version (
kube-router --version): [2.5.0, built on 2025-02-14T20:20:43Z, go1.23.6] - Kube-Router Parameters: [e.g. --run-router --run-service-proxy --enable-overlay --overlay-type=full etc.]
--kubeconfig=/usr/local/kube-router/kube-router.kubeconfig
--run-router=true
--run-firewall=true
--run-service-proxy=true
--v=3
--peer-router-ips=103.35.124.1
--peer-router-asns=65322
--cluster-asn=65321
--enable-ibgp=false
--enable-overlay=false
--bgp-graceful-restart=true
--bgp-graceful-restart-deferral-time=30s
--bgp-graceful-restart-time=5m
--advertise-external-ip=true
--ipvs-graceful-termination
--runtime-endpoint=unix:///run/containerd/containerd.sock
--enable-ipv6=true
--routes-sync-period=1m0s
--iptables-sync-period=1m0s
--ipvs-sync-period=1m0s
--hairpin-mode=true
--advertise-pod-cidr=true
- Kubernetes Version (
kubectl version) : 1.29.14 - Cloud Type: [e.g. AWS, GCP, Azure, on premise] onprem
- Kubernetes Deployment Type: [e.g. EKS, GKE, Kops, Kubeadm, etc.] manual
- Kube-Router Deployment Type: [e.g. DaemonSet, System Service] on host
- Cluster Size: [e.g. 200 Nodes] 2 nodes
- kernel version: 5.10.0-34-amd64
Apart from checking the open connection in IPVS table does it make sense to check the conntrack table on the host to see if there are any connection established with the pod?
https://github.com/cloudnativelabs/kube-router/blob/85e429e9c72b2bc7de93b5f1bcce20e7c924386d/pkg/controllers/proxy/network_service_graceful.go#L111-L113
If the deletion of the endpoint in the IPVS table is avoided, even when the other node that initially handled the TCP SYN connection goes down still traffic will continue to be correctly routed to the backend pod on its current host. Persistent IPVS table entry, combined with Maglev hashing, ensures packets reach the right backend pod.
I could see the conntrack when the connection was open (103.35.124.22 is the ip of tlx-dal-kubenode1-staging)
root@eqx-sjc-kubenode1-staging: $ conntrack -L -d 10.36.0.6
unknown 4 378 src=103.35.124.22 dst=10.36.0.6 [UNREPLIED] src=10.36.0.6 dst=103.35.124.22 mark=0 use=1
conntrack v1.4.7 (conntrack-tools): 1 flow entries have been shown.
The xml formatted output has more information conntrack -L -o xml -d 10.36.0.6
<?xml version="1.0" encoding="utf-8"?>
<conntrack>
<flow>
<meta direction="original">
<layer3 protonum="2" protoname="ipv4">
<src>103.35.124.22</src>
<dst>10.36.0.6</dst>
</layer3>
<layer4 protonum="4" protoname="unknown"></layer4>
</meta>
<meta direction="reply">
<layer3 protonum="2" protoname="ipv4">
<src>10.36.0.6</src>
<dst>103.35.124.22</dst>
</layer3>
<layer4 protonum="4" protoname="unknown"></layer4>
</meta>
<meta direction="independent">
<timeout>597</timeout>
<mark>0</mark>
<use>1</use>
<id>3724150459</id>
<unreplied/>
</meta>
</flow>
</conntrack>
Layer 4 protonum is 4 which means IPIP protocol https://elixir.bootlin.com/linux/v6.13.7/source/include/uapi/linux/in.h#L36 . I think the check can something like:
if the configuration is DSR and maglev check if there exists a conntrack entry with layer 4 = IPIP and destination IP = pod IP
Thanks for doing a lot of the leg work on this @anupamdialpad. I think that you're discovering a lot of interesting edge-cases with maglev hashing in kube-router. It's probably not one of the most used features inside kube-router, but I'm none-the-less glad that you are exercising it and trying to make it better for your use case and others.
Regarding this issue, I just want to make sure that I'm fully understanding your use-case and expectation. So, let me repeat back what I think I get from the information that you've put here and then you tell me if its what you are looking for.
- You have a client connected to a maglev service endpoint that exists as pod-1 on node-1. This connection is established via node-2.
- You do something that marks pod-1 as unready. However, the pod is not dead and it is still servicing the above connection from the client. a. In IPVS the endpoint is now marked as expiring, and it's weight for new connections is reduced to 0. b. However, this state is only known on node-2. On node-1 as there were no active IPVS connections, the endpoint is already withdrawn.
- node-2 is shut down. At this point, the client's connection is disconnected even though it could still be serviced via node-1 via ECMP.
So instead, you would like Step 2a and beyond to progress as:
- You do something that marks pod-1 as unready. However, the pod is not dead and it is still servicing the above connection from the client. b. node-1 checks both active IPVS connections as well as its conntrack table to see if there is an established connection to an endpoint on its node. Seeing an active connection to the endpoint, it leaves the IPVS entry in graceful expiry mode.
- node-2 is shut down. At this point, the client retries its connection and gets routed to node-1. a. node-1 still has an endpoint enabled in IPVS b. Even though the endpoint doesn't have any weight because of maglev and sloppy_tcp setting (see: #1860) it is still correctly hashed to the expiring endpoint c. Client successfully re-establishes its connection via node-1 and continues to communicate with the service
Did I get that correct? Did I miss anything?
Thanks @aauren for looking at it! Yes, you got it correctly.
So for this, I think that your solution solves it neatly as long as you only have a 2 node cluster. Unfortunately, I don't see how you could solve the problem in a multi-node cluster. For instance, if you had 3 nodes:
- node-1 - contains pod-1 (the workload) with the service endpoint
- node-2 - is a generic worker node that happens to be advertising the service
- node-3 - is a generic worker node that also is advertising the service
Client establishes a connection to pod-1 via node-2. pod-1 becomes unready and the service goes into graceful termination.
If we were to merge the connection tracking change in addition to checking the IPVS statistics, then I would imagine the following would be true:
- node-1 - would enter graceful termination for the service because it contains an active connection tracking state for the service
- node-2 - would enter graceful termination for the service because it contains IPVS connection state for the client
- node-3 - would withdraw the endpoint from IPVS because it doesn't contain the endpoint nor did it have an active client connection
- (same for nodes 4-N if you had more nodes)
node-2 is shutdown
Client retries its existing connection and is BGP hashed to node-3 (or any other node other than node-1) and it will still result in a connection refused right?
It seems to me that you have some pretty specific requirements regarding TCP connections. In most app environments I've been exposed to, this is usually handled by re-establishing the TCP connection and having good retry and backoff logic in the client. Is that not something that is viable in your use-case?
Client retries its existing connection and is BGP hashed to node-3 (or any other node other than node-1) and it will still result in a connection refused right?
Correct, the connection will fail.
this is usually handled by re-establishing the TCP connection and having good retry and backoff logic in the client. Is that not something that is viable in your use-case?
Unfortunately the client is not within our control :( Even if the client retires it will not be able to connect with the same backend pod after node-2 (in your example) goes down, right? Since only node-2 had the pod IP entry with weight 0 in the IPVS table, other nodes don't have pod-1 entry in the IPVS table.
Another idea I have is to delay the removal of pod endpoint from the IPVS table for the duration of its terminationGracePeriodSeconds or until the pod is terminated. This ensures that all nodes retain the pod endpoint but set its weight to 0. As a result, existing connections can continue to work.
Even if the client retires it will not be able to connect with the same backend pod after node-2 (in your example) goes down, right? Since only node-2 had the pod IP entry with weight 0 in the IPVS table, other nodes don't have pod-1 entry in the IPVS table.
I guess I was wondering whether sloppy_tcp would override this functionality or not. If IPVS won't allow the session to transition nodes when the endpoint weight is 0, then I'm not sure that there is a path forward at all. Because any other node the client tries to connect to will have to have the endpoint weight set to 0.
Do you know if this is true?
Looking at the sloppy_tcp patch it seems that this feature just allows TCP ACK to behave similar to TCP SYN for getting the backend server
/* No !th->ack check to allow scheduling on SYN+ACK for Active FTP */
rcu_read_lock();
- if (th->syn &&
+ if ((th->syn || sysctl_sloppy_tcp(ipvs)) && !th->rst &&
(svc = ip_vs_service_find(net, af, skb->mark, iph->protocol,
&iph->daddr, th->dest))) {
int ignored;
Because any other node the client tries to connect to will have to have the endpoint weight set to 0.
Yes this statement needs to be true for this scenario to work.
That's what I was thinking that instead of removing the endpoints when active/inactive connection count goes to 0 if the logic sets the weight to 0 when pod is unready then removes the endpoints only when the pod is terminated or terminationGracePeriodSeconds has expired.
@aauren any thoughts on my previous comment?
I guess that I'm still a bit dubious about whether or not setting the weight to 0, even with maglev caching, will allow you to get routed to that backend. Or that doing it this way won't break other use-cases that maybe rely on the endpoint being completely removed.
But you're welcome to try it out and let us know how it goes. If it works for you, feel free to submit a PR. If the PR still passes the upstream k8s conformance tests and doesn't break anything obvious it could potentially be accepted depending on how much logic it does or does not introduce.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue was closed because it has been stale for 5 days with no activity.