cloud-provider-aws
cloud-provider-aws copied to clipboard
NLB (type=internal) + ETP=local + Client IP Preservation needs to be fixed
What happened: Created a load balancer of type NLB:
$ oc describe svc -n openshift-ingress router-router-internal-nlb
Name: router-router-internal-nlb
Namespace: openshift-ingress
Labels: app=router
ingresscontroller.operator.openshift.io/owning-ingresscontroller=router-internal-nlb
router=router-router-internal-nlb
Annotations: service.beta.kubernetes.io/aws-load-balancer-healthcheck-healthy-threshold: 2
service.beta.kubernetes.io/aws-load-balancer-healthcheck-interval: 10
service.beta.kubernetes.io/aws-load-balancer-healthcheck-timeout: 4
service.beta.kubernetes.io/aws-load-balancer-healthcheck-unhealthy-threshold: 2
service.beta.kubernetes.io/aws-load-balancer-internal: true
service.beta.kubernetes.io/aws-load-balancer-type: nlb
traffic-policy.network.alpha.openshift.io/local-with-fallback:
Selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=router-internal-nlb
Type: LoadBalancer
IP Family Policy: SingleStack
IP Families: IPv4
IP: 172.30.57.241
IPs: 172.30.57.241
LoadBalancer Ingress: a0935eb5632a340f6b3e5487d3298847-5cea615e1ac1af7e.elb.us-west-1.amazonaws.com
Port: http 80/TCP
TargetPort: http/TCP
NodePort: http 31056/TCP
Endpoints: 10.131.0.17:80
Port: https 443/TCP
TargetPort: https/TCP
NodePort: https 31569/TCP
Endpoints: 10.131.0.17:443
Session Affinity: None
External Traffic Policy: Local
HealthCheck NodePort: 30882
$ oc get pods -n openshift-ingress -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
router-default-6bc948b565-5pncp 1/1 Running 0 71m 10.131.0.7 ip-10-0-171-72.us-west-1.compute.internal <none> <none>
router-default-6bc948b565-jcwtv 1/1 Running 0 71m 10.128.2.11 ip-10-0-190-96.us-west-1.compute.internal <none> <none>
router-router-internal-nlb-7f68bc757f-dpgbr 1/1 Running 0 56s 10.131.0.17 ip-10-0-171-72.us-west-1.compute.internal <none> <none>
router-router-internal-nlb-7f68bc757f-dpgbr is the backend of the LB svc.
ETP=local so let's pick the node where there is both router-nlb pod and router-default pod:
$ oc rsh -n openshift-ingress router-default-6bc948b565-5pncp
sh-4.4$ dig +short a0935eb5632a340f6b3e5487d3298847-5cea615e1ac1af7e.elb.us-west-1.amazonaws.com
10.0.163.109
sh-4.4$ curl --local-port 36363 10.0.163.109
curl: (7) Failed to connect to 10.0.163.109 port 80: Connection timed out
Traffic flow: pod (router-default-6bc948b565-5pncp) ->LBVIP (10.0.163.109) This traffic flow breaks. Since Client IP Preservation is enabled, the packet goes out of the cluster to AWS LB Node and comes back in with srcIP == nodeIP of the originating node. This hairpin case where the srcPod and dstPod are on same node doesn't work and its a werid scenario that CNI plugins don't anticipate.
TCPDUMP:
Traffic going out of the cluster:
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked v1), capture size 262144 bytes
09:53:21.301824 P 0a:58:0a:83:00:07 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 64, id 25158, offset 0, flags [DF], proto TCP (6), length 60)
10.131.0.7.36363 > 10.0.163.109.80: Flags [S], cksum 0xb825 (incorrect -> 0x530b), seq 3046883424, win 26583, options [mss 8861,sackOK,TS val 982481238 ecr 0,nop,wscale
7], length 0
09:53:21.302752 Out 06:d1:d3:93:8b:d9 ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 62, id 25158, offset 0, flags [DF], proto TCP (6), length 60)
10.0.171.72.36363 > 10.0.163.109.80: Flags [S], cksum 0xa84c (correct), seq 3046883424, win 26583, options [mss 8861,sackOK,TS val 982481238 ecr 0,nop,wscale 7], length
0
Traffic coming back into the cluster:
09:53:21.303227 In 06:b2:f4:da:68:9d ethertype IPv4 (0x0800), length 76: (tos 0x0, ttl 61, id 25158, offset 0, flags [DF], proto TCP (6), length 60)
10.0.171.72.36363 > 10.0.171.72.31056: Flags [S], cksum 0x2961 (correct), seq 3046883424, win 26583, options [mss 8365,sackOK,TS val 982481238 ecr 0,nop,wscale 7], lengt
h 0
I think this should be fixed on AWS side to either:
- Either turn off clientIP preservation for such cases of NLB+privateIP LBVIP so that hairpin case can work and document this. (its not like users have access to aws console always to do this, at creation time this needs to be done auto by AWS) - though note that by definition of ETP=local ; it means the srcIP of the client MUST be preserved (its what other providers do as well)
- OR create .IP type load balancers instead of hostnames so that CNI plugins can have rules within the cluster that can shortcircuit this traffic like we do for GCP/Azure.
Reproducible on all versions.
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
Environment:
- Kubernetes version (use
kubectl version
): - Cloud provider or hardware configuration:
- OS (e.g. from /etc/os-release):
- Kernel (e.g.
uname -a
): - Install tools:
- Others:
/kind bug