cluster-api-provider-aws icon indicating copy to clipboard operation
cluster-api-provider-aws copied to clipboard

Add ipv6 support

Open Skarlso opened this issue 2 years ago • 66 comments

TODOs

  • [x] Add missing things like:
    • [x] Routing
    • [x] Subnet definitions
    • [x] ENI support
    • [x] ... TBD
    • [x] Configure the aws-node vpc-cni addon to use environment property ENABLE_IPv6 with ENABLE_PREFIX_DELEGATION set to true
  • [x] Add documentations
  • [x] Lots of validation
  • [x] Add e2e tests
  • [x] Add unit tests
  • [x] Refactor

What type of PR is this?

/kind feature

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Part 1 for https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/2420.

We are aiming to do as less as possible so the PR doesn't get too much bloated.

Joined PR with @nikimanoledaki

Special notes for your reviewer:

This will be Part 1 for IPv6 support. What does Part 1 contain you might ask?

  • create an ipv6 cluster
  • refactor the code to offer both options without needing to use a lot of ifs
  • validations for things that ipv6 doesn't support and capa wouldn't support in the beginning

E2E test run for EKS cluster:

Ran 1 of 3 Specs in 1488.316 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 2 Skipped

Checklist:

  • [x] squashed commits
  • [x] includes documentation
  • [x] adds unit tests
  • [x] adds or updates e2e tests

Skarlso avatar Jun 08 '22 13:06 Skarlso

@Skarlso: This issue is currently awaiting triage.

If CAPA/CAPI contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 08 '22 13:06 k8s-ci-robot

Hi @Skarlso. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 08 '22 13:06 k8s-ci-robot

/assign Skarlso /assign nikimanoledaki

Skarlso avatar Jun 08 '22 13:06 Skarlso

@Skarlso: GitHub didn't allow me to assign the following users: nikimanoledaki.

Note that only kubernetes-sigs members, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to this:

/assign Skarlso /assign nikimanoledaki

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Jun 08 '22 13:06 k8s-ci-robot

/ok-to-test

richardcase avatar Jun 12 '22 08:06 richardcase

As a note it's probably not worth wasting run time on this yet as I'm sure I messed up a bunch of tests 😅

Skarlso avatar Jun 12 '22 09:06 Skarlso

Screenshot 2022-06-16 at 23 28 54

Starting to get there. :)

Skarlso avatar Jun 16 '22 21:06 Skarlso

Working subnet creation

Screenshot 2022-06-17 at 8 44 08

Skarlso avatar Jun 17 '22 06:06 Skarlso

Successfully created Ipv6 cluster:

I0617 07:01:35.730194     606 eks.go:34]  "msg"="Reconciling EKS control plane"  "cluster-name"="ipv6-cluster" "cluster-namespace"="default"
I0617 07:01:35.730243     606 roles.go:56]  "msg"="Reconciling EKS Control Plane IAM Role"  
I0617 07:01:35.730254     606 roles.go:60]  "msg"="no eks control plane role specified, using default eks control plane role"  
I0617 07:01:35.730262     606 roles.go:67]  "msg"="using eks control plane role"  "role-name"="eks-controlplane.cluster-api-provider-aws.sigs.k8s.io"
I0617 07:01:36.587591     606 roles.go:90]  "msg"="Skipping, EKS control plane role policy assignment as role is unamanged"  
I0617 07:01:36.587657     606 cluster.go:47]  "msg"="Reconciling EKS cluster"  
I0617 07:01:39.233322     606 cluster.go:413]  "msg"="Created EKS cluster in AWS"  "cluster-name"="default_ipv6-cluster-control-plane"

Now, testing pod access, machines, communication, etc... :D

Skarlso avatar Jun 17 '22 07:06 Skarlso

Screenshot 2022-06-17 at 9 16 55

Skarlso avatar Jun 17 '22 07:06 Skarlso

First IPv6 enabled WORKING CAPA Cluster y'all

k get pods -A --kubeconfig test.kubeconfig -o wide
NAMESPACE     NAME                       READY   STATUS    RESTARTS   AGE   IP                                       NODE                                           NOMINATED NODE   READINESS GATES
kube-system   aws-node-rp54r             1/1     Running   1          29m   2a05:d014:16d:7202:dff9:604c:cead:f657   ip-10-0-97-100.eu-central-1.compute.internal   <none>           <none>
kube-system   coredns-78666889b9-fsv9m   1/1     Running   0          37m   2a05:d014:16d:7202:560d::                ip-10-0-97-100.eu-central-1.compute.internal   <none>           <none>
kube-system   coredns-78666889b9-r62dn   1/1     Running   0          37m   2a05:d014:16d:7202:560d::1               ip-10-0-97-100.eu-central-1.compute.internal   <none>           <none>
kube-system   kube-proxy-tcc25           1/1     Running   0          29m   2a05:d014:16d:7202:dff9:604c:cead:f657   ip-10-0-97-100.eu-central-1.compute.internal   <none>           <none>

Skarlso avatar Jun 22 '22 14:06 Skarlso

+10,073 −178

About 9000 lines of generated IAM mock. :D

Skarlso avatar Jun 23 '22 13:06 Skarlso

Let's see if we didn't break anything so far.

/retest

Skarlso avatar Jun 23 '22 13:06 Skarlso

/retest

Skarlso avatar Jun 23 '22 14:06 Skarlso

Screenshot 2022-06-23 at 17 40 15

Hmm... The fuzzing passes locally.

Skarlso avatar Jun 23 '22 15:06 Skarlso

/retest

Skarlso avatar Jun 23 '22 15:06 Skarlso

/retest

Skarlso avatar Jun 23 '22 19:06 Skarlso

This fuzzy test thing is very :suspect:

--- FAIL: TestFuzzyConversion (37.63s)
    --- FAIL: TestFuzzyConversion/for_AWSManagedControlPlane (37.63s)
        --- PASS: TestFuzzyConversion/for_AWSManagedControlPlane/spoke-hub-spoke (37.62s)
        --- FAIL: TestFuzzyConversion/for_AWSManagedControlPlane/hub-spoke-hub (0.01s)
FAIL

Passes locally.

I was looking at the wrong suite. 🤦

Skarlso avatar Jun 23 '22 19:06 Skarlso

I ran it locally a couple of dozen times and it didn't fail. Even though it's clearly failing on the newly added values. 🤔

/retest

Skarlso avatar Jun 23 '22 19:06 Skarlso

/retest

Skarlso avatar Jun 24 '22 05:06 Skarlso

Nice!! Now that we know nothing existing is breaking ( as far as tests go :D ) the real work can begin of extending unit tests, adding e2e tests and adding documentation and validation. :)

Skarlso avatar Jun 24 '22 05:06 Skarlso

/retest

Skarlso avatar Jun 24 '22 08:06 Skarlso

Things are still working after refactorings and small fixes:

clusterctl describe cluster ipv6-cluster
NAME                                                                READY  SEVERITY  REASON  SINCE  MESSAGE
Cluster/ipv6-cluster                                                True                     10m
├─ControlPlane - AWSManagedControlPlane/ipv6-cluster-control-plane  True                     10m
└─Workers
  └─MachineDeployment/ipv6-cluster-md-0                             True                     79s
    └─Machine/ipv6-cluster-md-0-64d8d795ff-58sm8                    True                     4m31s

Nope... Something is not working... CoreDNS isn't working.

Found it. Missing IAM permission.

ipv6 k get pods -A --kubeconfig test.kubeconfig -o wide
NAMESPACE     NAME                       READY   STATUS    RESTARTS       AGE     IP                                       NODE                                       NOMINATED NODE   READINESS GATES
kube-system   aws-node-9jhx8             1/1     Running   1 (118s ago)   3m57s   2600:1f13:8a4:5102:db80:66c3:be32:ab6d   ip-10-0-95-89.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-657694c6f4-g7t2f   1/1     Running   0              12m     2600:1f13:8a4:5102:979a::1               ip-10-0-95-89.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-657694c6f4-rwx2m   1/1     Running   0              12m     2600:1f13:8a4:5102:979a::                ip-10-0-95-89.us-west-2.compute.internal   <none>           <none>
kube-system   kube-proxy-pzcng           1/1     Running   0              3m57s   2600:1f13:8a4:5102:db80:66c3:be32:ab6d   ip-10-0-95-89.us-west-2.compute.internal   <none>           <none>

Things are working 🎉

Skarlso avatar Jun 24 '22 11:06 Skarlso

NOTE: I will be on vacation for two weeks so expect some downtime on this. :) I have not abandoned it, I'm just taking a bit of time off.

Things that will be still added:

  • Documentation
  • E2E test
  • Validations
  • Lots of manual testing for security groups and communications

Not much left now. :)

Skarlso avatar Jun 24 '22 14:06 Skarlso

Nice, the new code skips updating ipv6 environment properties on vpc-cni if it's defined and set to true already. Skipping an update to the resource.

I0625 16:22:27.906893       9 cni.go:182]  "msg"="ipv6 environment properties already set, skipping update of vpc-cni addon" 

Skarlso avatar Jun 25 '22 16:06 Skarlso

First, trivial test.

Pod to pod ping on same node.

bash-4.2# ping6 2600:1f13:859:ec02:545c::1
PING 2600:1f13:859:ec02:545c::1(2600:1f13:859:ec02:545c::1) 56 data bytes
64 bytes from 2600:1f13:859:ec02:545c::1: icmp_seq=1 ttl=64 time=0.047 ms
64 bytes from 2600:1f13:859:ec02:545c::1: icmp_seq=2 ttl=64 time=0.052 ms
64 bytes from 2600:1f13:859:ec02:545c::1: icmp_seq=3 ttl=64 time=0.049 ms
64 bytes from 2600:1f13:859:ec02:545c::1: icmp_seq=4 ttl=64 time=0.042 ms

Niceee. 🟢

Skarlso avatar Jun 25 '22 16:06 Skarlso

Let's create a couple of pods and see IP assignment with inflate.

Skarlso avatar Jun 25 '22 16:06 Skarlso

Nice, looking good:

➜  ipv6 k get pods -A --kubeconfig test.kubeconfig -o wide
NAMESPACE     NAME                       READY   STATUS    RESTARTS      AGE   IP                                      NODE                                       NOMINATED NODE   READINESS GATES
default       inflate-866ccdf4c8-7lm65   1/1     Running   0             22s   2600:1f13:859:ec02:545c::8              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
default       inflate-866ccdf4c8-ghsmq   1/1     Running   0             22s   2600:1f13:859:ec02:545c::4              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
default       inflate-866ccdf4c8-j7x2s   1/1     Running   0             22s   2600:1f13:859:ec02:545c::6              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
default       inflate-866ccdf4c8-jsngz   0/1     Pending   0             22s   <none>                                  <none>                                     <none>           <none>
default       inflate-866ccdf4c8-jvnhn   0/1     Pending   0             22s   <none>                                  <none>                                     <none>           <none>
default       inflate-866ccdf4c8-m4xnv   1/1     Running   0             22s   2600:1f13:859:ec02:545c::7              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
default       inflate-866ccdf4c8-mlllj   1/1     Running   0             93s   2600:1f13:859:ec02:545c::2              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
default       inflate-866ccdf4c8-nw6dq   1/1     Running   0             22s   2600:1f13:859:ec02:545c::3              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
default       inflate-866ccdf4c8-ptp6s   0/1     Pending   0             22s   <none>                                  <none>                                     <none>           <none>
default       inflate-866ccdf4c8-wxk5n   1/1     Running   0             22s   2600:1f13:859:ec02:545c::5              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
kube-system   aws-node-4jcdw             1/1     Running   1 (10m ago)   12m   2600:1f13:859:ec02:7fb:b410:fe63:20e2   ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-657694c6f4-hdz92   1/1     Running   0             23m   2600:1f13:859:ec02:545c::1              ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
kube-system   coredns-657694c6f4-l9btx   1/1     Running   0             23m   2600:1f13:859:ec02:545c::               ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>
kube-system   kube-proxy-42kht           1/1     Running   0             12m   2600:1f13:859:ec02:7fb:b410:fe63:20e2   ip-10-0-93-63.us-west-2.compute.internal   <none>           <none>

Skarlso avatar Jun 25 '22 16:06 Skarlso

Node to node communication is working. This is all private and internal. Now, let's try creating a service and putting nodes into a public subnet.

Skarlso avatar Jun 25 '22 16:06 Skarlso

I just realised that EgressOnlyInternetGateway creation needs to be part of the Network creation conditions when IPv6 is enabled. But only if there are private subnets in the picture. I guess it's fine. ¯_(ツ)_/¯ we just pre-create it, no matter what, and we just don't use it if it's not needed. It shouldn't eat any resources.

Skarlso avatar Jun 25 '22 17:06 Skarlso