amazon-vpc-cni-k8s
amazon-vpc-cni-k8s copied to clipboard
IPv6 containers experience connectivity issues with large simultaneous file downloads
What happened:
Observed behavior is that large simultaneous downloads stall out and eventually we receive a "connection reset by peer" error. Sometimes, we also see TLS connection errors and DNS resolution errors, which cause some downloads to immediately error out.
These errors only affect downloads from IPv6 servers/endpoints. IPv4 works perfectly fine.
Example error output
Sometimes we see errors around establishing connections over HTTPS:
test9 | Connecting to embed-ssl.wistia.com (embed-ssl.wistia.com)|2600:9000:244d:7800:1e:c86:4140:93a1|:443... connected.
test9 | Unable to establish SSL connection.
test9 | exit status 4
test3 | Resolving embed-ssl.wistia.com (embed-ssl.wistia.com)... failed: Try again.
test3 | wget: unable to resolve host address 'embed-ssl.wistia.com'
test3 | exit status 4
We host-mounted the CNI logs on the hosts we performed the testing, but didn't see any associated logs during our testing.
What you expected to happen:
Downloads complete without connection errors
How to reproduce it (as minimally and precisely as possible):
We have a Procfile that runs 9 downloads of a 700MB file in parallel.
Debian Slim Container
Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/debian/debian:bullseye-slim --command -- bash
apt-get update && apt-get install -y wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile
Alpine Container
Launch a container: kubectl run -it --rm ipv6-reset-test-debian --image public.ecr.aws/docker/library/alpine:3.19.1 --command -- ash
`
apk add wget # use non-busybox wget
ARCH="$(arch | sed s/aarch64/arm64/ | sed s/x86_64/amd64/)"
wget https://github.com/wistia/hivemind/releases/download/v1.1.1/hivemind-v1.1.1-wistia-linux-$ARCH.gz
gunzip hivemind-v1.1.1-wistia-linux-$ARCH.gz
mv hivemind-v1.1.1-wistia-linux-$ARCH hivemind
chmod +x hivemind
wget https://raw.githubusercontent.com/wistia/eks-ipv6-reset-example/main/Procfile
./hivemind -W Procfile
Anything else we need to know?:
Environment is a dualstack IPv4/IPv6 VPC. We've been able to reproduce this on both nodes on public/private subnets.
Environment: Kubernetes Versions:
- 1.28.5 (eks.7) w/ kube-proxy v1.28.2-eksbuild.2
- 1.29.0 (eks.1) w/ kube-proxy v1.29.0-eksbuild.2
Reproduced across AL2/Ubuntu/Bottlerocket with Kernel versions via EKS Managed Nodegroups:
-AL2: 5.10.209-198.858.amzn2.aarch64 / 5.10.209-198.858.amzn2.x86_64
- Ubuntu 22:
6.2.0-1017-aws #17~22.04.1-Ubuntu SMP - Ubuntu 20:
5.15.0-1048-aws #53~20.04.1-Ubuntu SMP - Bottlerocket: 1.18.0-7452c37e , 1.19.2-29cc92cc
Reproduced on AWS VPC CNI versions:
- v1.16.3-eksbuild.2
- v1.15.1-eksbuild.1
Instance types used:
- m6g.xlarge
- c6g.xlarge
- m7a.8xlarge
- m6a.8xlarge