cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

NLB (type=internal) + ETP=local + Client IP Preservation needs to be fixed

Open tssurya opened this issue 1 year ago • 16 comments

What happened: Created a load balancer of type NLB:

$ oc describe svc -n openshift-ingress router-router-internal-nlb
Name:                     router-router-internal-nlb
Namespace:                openshift-ingress
Labels:                   app=router
                          ingresscontroller.operator.openshift.io/owning-ingresscontroller=router-internal-nlb
                          router=router-router-internal-nlb
Annotations:              service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: 2
                          service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: 10
                          service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: 4
                          service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: 2
                          service.beta.kubernetes.io/aws-load-balancer-internal: true
                          service.beta.kubernetes.io/aws-load-balancer-type: nlb
                          traffic-policy.network.alpha.openshift.io/local-with-fallback: 
Selector:                 ingresscontroller.operator.openshift.io/deployment-ingresscontroller=router-internal-nlb
Type:                     LoadBalancer
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       172.30.57.241
IPs:                      172.30.57.241
LoadBalancer Ingress:     a0935eb5632a340f6b3e5487d3298847-5cea615e1ac1af7e.elb.us-west-1.amazonaws.com
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  31056/TCP
Endpoints:                10.131.0.17:80
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  31569/TCP
Endpoints:                10.131.0.17:443
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30882
$ oc get pods -n openshift-ingress -owide                                                                                                
NAME                                          READY   STATUS    RESTARTS   AGE   IP            NODE                                        NOMINATED NODE   READINESS GATES  
router-default-6bc948b565-5pncp               1/1     Running   0          71m   10.131.0.7    ip-10-0-171-72.us-west-1.compute.internal   <none>           <none>           
router-default-6bc948b565-jcwtv               1/1     Running   0          71m   10.128.2.11   ip-10-0-190-96.us-west-1.compute.internal   <none>           <none>           
router-router-internal-nlb-7f68bc757f-dpgbr   1/1     Running   0          56s   10.131.0.17   ip-10-0-171-72.us-west-1.compute.internal   <none>           <none> 

router-router-internal-nlb-7f68bc757f-dpgbr is the backend of the LB svc.

ETP=local so let's pick the node where there is both router-nlb pod and router-default pod:

$ oc rsh -n openshift-ingress router-default-6bc948b565-5pncp
sh-4.4$ dig +short a0935eb5632a340f6b3e5487d3298847-5cea615e1ac1af7e.elb.us-west-1.amazonaws.com
10.0.163.109
sh-4.4$ curl --local-port 36363 10.0.163.109
curl: (7) Failed to connect to 10.0.163.109 port 80: Connection timed out

Traffic flow: pod (router-default-6bc948b565-5pncp) ->LBVIP (10.0.163.109) This traffic flow breaks. Since Client IP Preservation is enabled, the packet goes out of the cluster to AWS LB Node and comes back in with srcIP == nodeIP of the originating node. This hairpin case where the srcPod and dstPod are on same node doesn't work and its a werid scenario that CNI plugins don't anticipate.

TCPDUMP:

Traffic going out of the cluster:
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes                                                                                  
09:53:21.301824   P 0a:58:0a:83:00:07 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 25158, offset 0, flags [DF], proto TCP (6), length 60)                        
    10.131.0.7.36363 > 10.0.163.109.80: Flags [S], cksum 0xb825 (incorrect -> 0x530b), seq 3046883424, win 26583, options [mss 8861,sackOK,TS val 982481238 ecr 0,nop,wscale 
7], length 0                                                                                                                                                                 
09:53:21.302752 Out 06:d1:d3:93:8b:d9 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 62, id 25158, offset 0, flags [DF], proto TCP (6), length 60)                        
    10.0.171.72.36363 > 10.0.163.109.80: Flags [S], cksum 0xa84c (correct), seq 3046883424, win 26583, options [mss 8861,sackOK,TS val 982481238 ecr 0,nop,wscale 7], length 
0                                      

Traffic coming back into the cluster:                                                                                                                                      
09:53:21.303227  In 06:b2:f4:da:68:9d ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 61, id 25158, offset 0, flags [DF], proto TCP (6), length 60)                        
    10.0.171.72.36363 > 10.0.171.72.31056: Flags [S], cksum 0x2961 (correct), seq 3046883424, win 26583, options [mss 8365,sackOK,TS val 982481238 ecr 0,nop,wscale 7], lengt
h 0                                                                

I think this should be fixed on AWS side to either:

  1. Either turn off clientIP preservation for such cases of NLB+privateIP LBVIP so that hairpin case can work and document this. (its not like users have access to aws console always to do this, at creation time this needs to be done auto by AWS) - though note that by definition of ETP=local ; it means the srcIP of the client MUST be preserved (its what other providers do as well)
  2. OR create .IP type load balancers instead of hostnames so that CNI plugins can have rules within the cluster that can shortcircuit this traffic like we do for GCP/Azure.

Reproducible on all versions.

What you expected to happen:

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version):
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:

/kind bug

tssurya avatar Jan 25 '23 10:01 tssurya