cilium icon indicating copy to clipboard operation
cilium copied to clipboard

NLB on AWS EKS with UDP-only targets, and IP target type behaves strangely with Cilium enabled

Open xyzzyz opened this issue 3 years ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

What happened?

When I create a NLB on EKS with UDP-only targets, and use target-type: ip instead of instance, pods on some nodes get UDP packets from Load Balancer, but their response never reaches back the sender.

This only happens on some nodes, pods on other nodes can reply to the sender just fine.

When I delete and recreate AWS instances backing the worker nodes in the cluster, again this only happens on some nodes, and other nodes work just fine (it's just the set of working/broken is different now).

When I uninstall Cilium, and reinstall AWS CNI, all pods can receive UDP packets and reply to the originator. When I reinstall Cilium (deleting AWS CNI and doing "cilium install"), the pattern of some nodes not being able to reply reoccurs.

Cilium Version

cilium-cli: v0.10.6 compiled with go1.18.1 on linux/amd64 cilium image (default): v1.10.11 cilium image (stable): v1.11.5 cilium image (running): v1.11.3

Kernel Version

5.4.190-107.353.amzn2.x86_64

Kubernetes Version

Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.9-eks-0d102a7", GitCommit:"eb09fc479c1b2bfcc35c47416efb36f1b9052d58", GitTreeState:"clean", BuildDate:"2022-02-17T16:36:28Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}

Sysdump

(file upload failed)

Relevant log output

No response

Anything else?

To repro: make sure you have 7+ nodes in your cluster create echo server:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    test-app: echo-udp
  name: echo-udp
spec:
  replicas: 15
  selector:
    matchLabels:
      test-app: echo-udp
  template:
    metadata:
      labels:
        test-app: echo-udp
    spec:
      containers:
      - name: echo
        image: n0r1skcom/echo
---
apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
    service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
    service.beta.kubernetes.io/aws-load-balancer-type: external
  labels:
    test-app: echo-udp
  name: echo-udp
spec:
  type: LoadBalancer
  ports:
  - name: echo
    port: 3333
    protocol: UDP
    targetPort: 3333
  selector:
    test-app: echo-udp

run in loop on your laptop:

for i in $(seq 1 1000); do echo $i; echo $i | nc -w 1 -u <IP-of-NLB> 3333 | grep Hostname || echo FAIL FAIL FAIL; sleep 1; done

(try all IPs returned by NLB's DNS record, I found that some fail more often than others)

and in your cluster, do

kubectl logs -f -l test-app=echo-udp --max-log-requests 15 | grep -A 4 UDP

to observe how all UDP packets are reaching the pods.

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

xyzzyz avatar May 27 '22 00:05 xyzzyz

I don't see how to investigate it without sysdump and traces of each hop.

brb avatar May 27 '22 09:05 brb

https://drive.google.com/file/d/1TmWZ72rw5cTFEx9pOelAR9jOq5Y3T4mL/view?usp=sharing

sorry, I couldn't upload it to github.

You might be able to repro it using my deployment/service above. I reproed it in multiple separate EKS clusters.

xyzzyz avatar May 27 '22 16:05 xyzzyz

(I uploaded sysdump above)

xyzzyz avatar Jun 02 '22 16:06 xyzzyz

cc @brb

aanm avatar Jun 03 '22 09:06 aanm

I still need to know what packet and hops.

brb avatar Jun 08 '22 12:06 brb

Sorry, I'm not sure what else you need, exactly. Can you be more specific? I'll be happy to provide it.

xyzzyz avatar Jun 10 '22 20:06 xyzzyz

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] avatar Aug 10 '22 02:08 github-actions[bot]

This issue has not seen any activity since it was marked stale. Closing.

github-actions[bot] avatar Aug 24 '22 02:08 github-actions[bot]