NLB on AWS EKS with UDP-only targets, and IP target type behaves strangely with Cilium enabled
Is there an existing issue for this?
- [X] I have searched the existing issues
What happened?
When I create a NLB on EKS with UDP-only targets, and use target-type: ip instead of instance, pods on some nodes get UDP packets from Load Balancer, but their response never reaches back the sender.
This only happens on some nodes, pods on other nodes can reply to the sender just fine.
When I delete and recreate AWS instances backing the worker nodes in the cluster, again this only happens on some nodes, and other nodes work just fine (it's just the set of working/broken is different now).
When I uninstall Cilium, and reinstall AWS CNI, all pods can receive UDP packets and reply to the originator. When I reinstall Cilium (deleting AWS CNI and doing "cilium install"), the pattern of some nodes not being able to reply reoccurs.
Cilium Version
cilium-cli: v0.10.6 compiled with go1.18.1 on linux/amd64 cilium image (default): v1.10.11 cilium image (stable): v1.11.5 cilium image (running): v1.11.3
Kernel Version
5.4.190-107.353.amzn2.x86_64
Kubernetes Version
Server Version: version.Info{Major:"1", Minor:"21+", GitVersion:"v1.21.9-eks-0d102a7", GitCommit:"eb09fc479c1b2bfcc35c47416efb36f1b9052d58", GitTreeState:"clean", BuildDate:"2022-02-17T16:36:28Z", GoVersion:"go1.16.12", Compiler:"gc", Platform:"linux/amd64"}
Sysdump
(file upload failed)
Relevant log output
No response
Anything else?
To repro: make sure you have 7+ nodes in your cluster create echo server:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
test-app: echo-udp
name: echo-udp
spec:
replicas: 15
selector:
matchLabels:
test-app: echo-udp
template:
metadata:
labels:
test-app: echo-udp
spec:
containers:
- name: echo
image: n0r1skcom/echo
---
apiVersion: v1
kind: Service
metadata:
annotations:
service.beta.kubernetes.io/aws-load-balancer-nlb-target-type: ip
service.beta.kubernetes.io/aws-load-balancer-scheme: internet-facing
service.beta.kubernetes.io/aws-load-balancer-type: external
labels:
test-app: echo-udp
name: echo-udp
spec:
type: LoadBalancer
ports:
- name: echo
port: 3333
protocol: UDP
targetPort: 3333
selector:
test-app: echo-udp
run in loop on your laptop:
for i in $(seq 1 1000); do echo $i; echo $i | nc -w 1 -u <IP-of-NLB> 3333 | grep Hostname || echo FAIL FAIL FAIL; sleep 1; done
(try all IPs returned by NLB's DNS record, I found that some fail more often than others)
and in your cluster, do
kubectl logs -f -l test-app=echo-udp --max-log-requests 15 | grep -A 4 UDP
to observe how all UDP packets are reaching the pods.
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
I don't see how to investigate it without sysdump and traces of each hop.
https://drive.google.com/file/d/1TmWZ72rw5cTFEx9pOelAR9jOq5Y3T4mL/view?usp=sharing
sorry, I couldn't upload it to github.
You might be able to repro it using my deployment/service above. I reproed it in multiple separate EKS clusters.
(I uploaded sysdump above)
cc @brb
I still need to know what packet and hops.
Sorry, I'm not sure what else you need, exactly. Can you be more specific? I'll be happy to provide it.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
This issue has not seen any activity since it was marked stale. Closing.