cloud-provider-aws icon indicating copy to clipboard operation
cloud-provider-aws copied to clipboard

Support for IPv6/dualstack

Open james-callahan opened this issue 11 months ago • 11 comments

What would you like to be added:

I'd like to start using dualstack in our kubernetes cluster via the CloudDualStackNodeIPs feature gate. Trying to do so I get errors such as:

I0814 03:40:03.764807       1 node_controller.go:427] Initializing node i-0a11a57aeffb69cf7 with cloud provider
E0814 03:40:04.070450       1 node_controller.go:236] error syncing 'i-0a11a57aeffb69cf7': failed to get node modifiers from cloud provider: provided node ip for node "i-0a11a57aeffb69cf7" is not valid: failed to get node address from cloud provider that matches ip: 2600:1f10:45a5:a900:33fc:a923:65e5:9414, requeuing

Trying to debug the issue, I think it's because the code at https://github.com/kubernetes/cloud-provider-aws/blob/d0551093673e8c355db17249b8f069767c014748/pkg/providers/v2/instances.go#L216C46-L216C64 doesn't look at Ipv6Addresses. It only iterates over the IPv4 addresses in PrivateIpAddresses.

Why is this needed:

The EC2 api returns IPv6 and IPv4 addresses in different fields.

/kind feature

james-callahan avatar Aug 14 '23 04:08 james-callahan

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot avatar Aug 14 '23 04:08 k8s-ci-robot

AWS CCM has been patching in both IPv6 and IPv4 IPs for quite some time. You just have to set NodeIPFamilies to something like ipv6 and ipv4.

See https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L1599

olemarkus avatar Aug 14 '23 08:08 olemarkus

AWS CCM has been patching in both IPv6 and IPv4 IPs for quite some time. You just have to set NodeIPFamilies to something like ipv6 and ipv4.

See https://github.com/kubernetes/cloud-provider-aws/blob/master/pkg/providers/v1/aws.go#L1599

I'm using v2, not v1.

james-callahan avatar Aug 14 '23 11:08 james-callahan

I'm using v2, not v1.

As found in https://github.com/kubernetes/cloud-provider-aws/issues/677 I'm using v1 after all.

I gave this another attempt, setting the feature gate CloudDualStackNodeIPs=true, and the cloud provider failed with e.g.:

I1017 02:43:36.977098       1 node_controller.go:431] Initializing node i-02de3f9b2d02feaa7 with cloud provider
E1017 02:43:37.264596       1 node_controller.go:240] error syncing 'i-02de3f9b2d02feaa7': failed to get node modifiers from cloud provider: provided node ip for node "i-02de3f9b2d02feaa7" is not valid: failed to get node address from cloud provider that matches ip: 2600:1f10:45a5:a918:5d99:c7b9:243:210f, requeuing

I realised that NodeIPFamilies defaults to only ipv4, so I added ipv6 to my cloudconfig:

[Global]
NodeIPFamilies=ipv4,ipv6

Which I can verify works via the log line:

I1017 02:58:51.340872       1 aws.go:1433] The following IP families will be added to nodes: [ipv4,ipv6]

The controller is now failing with e.g.:

I1017 03:04:58.888797       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:04:59.302680       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:04:59.302717       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:04:59.548721       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:05:01.368647       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:05:01.690156       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:05:05.698132       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:05:06.089973       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing
I1017 03:05:14.785853       1 node_controller.go:431] Initializing node i-083e6ed22b10ddf06 with cloud provider
E1017 03:05:15.083704       1 node_controller.go:240] error syncing 'i-083e6ed22b10ddf06': failed to get node modifiers from cloud provider: provided node ip for node "i-083e6ed22b10ddf06" is not valid: failed to get node address from cloud provider that matches ip: 10.24.152.220, requeuing

I'm not sure why it's failing to get the node address, see aws ec2 describe-instances --instance-ids i-083e6ed22b10ddf06 | jq '.Reservations[].Instances[] | {PrivateIpAddress,Ipv6Address,NetworkInterfaces}'

{
  "PrivateIpAddress": "10.24.152.220",
  "Ipv6Address": "2600:1f10:45a5:a918:fd18:12af:1613:6c5d",
  "NetworkInterfaces": [
    {
      "Association": {
        "IpOwnerId": "amazon",
        "PublicDnsName": "ec2-3-85-73-150.compute-1.amazonaws.com",
        "PublicIp": "3.85.73.150"
      },
      "Attachment": {
        "AttachTime": "2023-10-17T03:03:45+00:00",
        "AttachmentId": "eni-attach-024b4933411c5f575",
        "DeleteOnTermination": true,
        "DeviceIndex": 0,
        "Status": "attached",
        "NetworkCardIndex": 0
      },
      "Description": "",
      "Groups": [
        {
          "GroupName": "internal-talos-worker-general",
          "GroupId": "sg-007b939554373cc2b"
        }
      ],
      "Ipv6Addresses": [
        {
          "Ipv6Address": "2600:1f10:45a5:a918:fd18:12af:1613:6c5d",
          "IsPrimaryIpv6": false
        }
      ],
      "MacAddress": "0e:41:8b:af:7f:5f",
      "NetworkInterfaceId": "eni-0aabf40c0e2dcd595",
      "OwnerId": "799078726966",
      "PrivateDnsName": "i-083e6ed22b10ddf06.ec2.internal",
      "PrivateIpAddress": "10.24.152.220",
      "PrivateIpAddresses": [
        {
          "Association": {
            "IpOwnerId": "amazon",
            "PublicDnsName": "ec2-3-85-73-150.compute-1.amazonaws.com",
            "PublicIp": "3.85.73.150"
          },
          "Primary": true,
          "PrivateDnsName": "i-083e6ed22b10ddf06.ec2.internal",
          "PrivateIpAddress": "10.24.152.220"
        }
      ],
      "SourceDestCheck": true,
      "Status": "in-use",
      "SubnetId": "subnet-00c5e1b9c4baddcb3",
      "VpcId": "vpc-060c91b3879fc8b83",
      "InterfaceType": "interface"
    }
  ]
}

james-callahan avatar Oct 17 '23 03:10 james-callahan

From poking around the code and seeing your info above, it's not apparent to me what went wrong yet. Would it be convenient to add additional logging? Would be curious what addresses get returned by the cloud provider given that the IP it's looking for is very apparent.

mmerkes avatar Oct 17 '23 04:10 mmerkes

Would it be convenient to add additional logging?

Not really for our configuration; would have to set up a whole custom build pipeline where we currently use the upstream image.

Would be curious what addresses get returned by the cloud provider given that the IP it's looking for is very apparent.

Yeah that's probably a good debug log to add. Might be good to add it in any case?

james-callahan avatar Oct 18 '23 01:10 james-callahan

Not really for our configuration; would have to set up a whole custom build pipeline where we currently use the upstream image.

A repro would make it a lot easier to debug. Perhaps it could be setup via another mechanism, if it's an issue with the cloud provider.

Yeah that's probably a good debug log to add. Might be good to add it in any case?

Ya. There's not a lot of logging in the cloud provider, though some of this could make sense to add in kubernetes/kubernetes, and seems very reasonable to add some debug level logging for exactly this kind of thing.

mmerkes avatar Oct 18 '23 03:10 mmerkes

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot avatar Feb 04 '24 18:02 k8s-triage-robot

/remove-lifecycle stale

I would love it if someone could just add some more debug logging around this in the cloud provider. Then once there's another release I'd be able to share debug logs.

james-callahan avatar Feb 21 '24 04:02 james-callahan

We face the same issue. I created a cloud-config file and set the NodeIPFamilies and I can see that it is in-use in the aws-cloud-controller-manager logs. I also had to add --feature-gates=CloudDualStackNodeIPs=true to the aws-cloud-controller-manager and kubelet. When I set --node-ip=<IPv6 address>,<IPv4 address> to the kubelet then I receive log lines like this and the node was tainted with node.cloudprovider.kubernetes.io/uninitialized=true:NoSchedule.

024-05-08T08:21:12.851616193Z E0508 08:21:12.851520       1 node_controller.go:240] error syncing 'i-08b4defa905155953.eu-west-1.compute.internal': failed to get node modifiers from cloud provider: provided node ip for node "i-08b4defa905155953.eu-west-1.compute.internal" is not valid: failed to get node address from cloud provider that matches ip: 2xxx:xxxx:xxxx:xxxx::c91a, requeuing

But I saw both the IPv6 and the IPv4 address in the InternalIP. Then I set --node-ip=:: for the kubelet and it suddenly started to work but I saw only the IPv6 address in the InternalIP. Which is kinda expected based on the kubelet documentation. This is our test cluster, if you tell me what logs/tests do you want then I can execute them.

akunszt avatar May 08 '24 08:05 akunszt

I think I found what caused this. I added a lot of klog.* lines to the NodeAddressesByProviderID function. This was interesting:

        for _, family := range c.cfg.Global.NodeIPFamilies {
                klog.Infof( "family: %v", family )

It generated this log line:

I0508 10:10:40.861561     881 aws.go:1676] family: ipv4,ipv6

So the configuration is parsed as a string ipv4,ipv6 instead of splitting the values into an array. I dug a little deeper and I found out how to set a multi-value configuration at https://pkg.go.dev/gopkg.in/gcfg.v1#example-ReadStringInto-Multivalue After I changed the cloud-config.conf to this everything started to work.

[Global]
NodeIPFamilies=ipv4
NodeIPFamilies=ipv6

I recommend to include this in the documentation. It was a bit frustrating that I had to read the code as I did not find any documentation about how to construct the cloud-config file (I even started with a YAML first).

akunszt avatar May 08 '24 10:05 akunszt