cilium icon indicating copy to clipboard operation
cilium copied to clipboard

After upgrading to v1.17.1 from v1.16, Failed to start with `No IPv6 support on node as ipv6 address is nil`

Open jungeonkim opened this issue 10 months ago • 13 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Version

equal or higher than v1.17.1 and lower than v1.18.0

What happened?

time="2025-02-23T06:17:49.806446531Z" level=error msg="No IPv6 support on node as ipv6 address is nil" ipAddr.ipv4=10.0.0.* ipAddr.ipv6="<nil>" nodeName=node-1 subsys=daemon
time="2025-02-23T06:17:49.806464032Z" level=error msg="unable to connect to get node spec from apiserver" error="node node-1 does not have an IPv6 address" subsys=daemon
time="2025-02-23T06:17:49.806830253Z" level=info msg="Waiting for all endpoints' goroutines to be stopped." subsys=daemon
time="2025-02-23T06:17:49.806857642Z" level=info msg="All endpoints' goroutines stopped." subsys=daemon

After agent is terminated.

I did a little bit of debugging

In v1.16.x, setDefaultPrefix in pkg/node/address.go sets the v6 address before if option.Config.EnableIPv6 && nodeIP6 == nil { is executed. But, in v1.17.2, setDefaultPrefix is executed later.

How can we reproduce the issue?

helm values.yaml

autoDirectNodeRoutes: true
bgpControlPlane:
  enabled: true
  v2Enabled: false
enableIPv4Masquerade: false
enableIPv6Masquerade: false
ipam:
  operator:
    clusterPoolIPv4PodCIDRList: 10.0.0.0/16
    clusterPoolIPv6PodCIDRList: fd00::/108
ipv6:
  enabled: true
k8sServiceHost: 172.16.0.1
k8sServicePort: "6443"
kubeProxyReplacement: "true"
routingMode: native
envoy:
  enabled: false

If I roll back to v1.16.7 everything works fine.

Cilium Version

v1.17.1

Kernel Version

6.8.0-53-generic #55-Ubuntu

Kubernetes Version

v1.32.2

Regression

v1.16.7

Sysdump

No response

Relevant log output


Anything else?

No response

Cilium Users Document

  • [ ] Are you a user of Cilium? Please add yourself to the Users doc

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

jungeonkim avatar Feb 23 '25 09:02 jungeonkim

@jungeonkim Is it resolved if you set k8s.requireIPv6PodCIDR: true in your Helm values?

christarazi avatar Mar 24 '25 21:03 christarazi

@jungeonkim Is it resolved if you set k8s.requireIPv6PodCIDR: true in your Helm values?

I tried that but it didn't work.

jungeonkim avatar Mar 25 '25 04:03 jungeonkim

Observing the same issue here. CiliumNode running 1.16 gets spec.addresses entries for both the ipv4 and ipv6 addresses of the node, while 1.17 CiliumNode does not.

My base Node objects do not have their ipv6 address in their status.addresses list (running k3s dual-stack). Does cilium want this populated now for ipv6 to work?

oivindoh avatar Apr 07 '25 15:04 oivindoh

I believe https://github.com/cilium/cilium/pull/38472 is the fix and is slated to be in v1.17.3. If you can try it now, you can deploy the latest dev version of Cilium based off main.

christarazi avatar Apr 08 '25 23:04 christarazi

I just ran quay.io/cilium/cilium:v1.18.0-pre.1@sha256:88711d5016c6969e47e92e5f499ccd80e3df93ab52bdd7bc321c2b4a6a434a9e to test (as the release notes specify that patch is in).

Still seeing time="2025-04-09T13:23:11.205268888Z" level=fatal msg="failed to start: daemon creation failed: unable to connect to get node spec from apiserver: node server01 does not have an IPv6 address\nfailed to stop: unable to find controller ipcache-inject-labels" subsys=daemon

Correctly supplying the kind: Node objects with both ipv4 and ipv6 addresses (in my case hardcoding ipv4/ipv6 in k3s config instead of relying on 0.0.0.0 magic for kubelet node-ip arg) lets cilium start up successfully, as a workaround.

oivindoh avatar Apr 09 '25 13:04 oivindoh

@oivindoh Interesting. Sounds like a separate issue. The fact that your workaround is working confirms that the PR is the fix.

christarazi avatar Apr 09 '25 15:04 christarazi

Same issue here(I used cilium v1.17.3), I try to patch an IPv6 address to ciliumNode, but it seems cilium will revert it.

cyclinder avatar Apr 27 '25 06:04 cyclinder

Same issue here with version 1.17.1 and also 1.17.3. With a roll back to 1.16.9 it works.

steled avatar Apr 29 '25 10:04 steled

Today I tested it with version 1.17.4 and still the same problem but version 1.16.10 is working.

steled avatar May 28 '25 03:05 steled

@christarazi today I also tested the v1.18.0-pre.2 image but also the same problem.

steled avatar Jun 02 '25 13:06 steled

I would like to fix this, I will look into it later.

cyclinder avatar Jun 03 '25 01:06 cyclinder

This is related to this bug in this thread https://github.com/siderolabs/talos/issues/8115#issuecomment-2307280220

With a workaround where you change the pod and service-subnets to try IPv6 before IPv4.

patkarcarasent avatar Jun 12 '25 13:06 patkarcarasent

(also hitting this right off main, my bond0 dev did have an IPv6 address though)

borkmann avatar Jun 13 '25 09:06 borkmann

I'm currently unable to repro this in a local kind cluster, but I believe that, as @jungeonkim suggested, the issue is related to the missing call to SetDefaultPrefix that in v1.16 was happening as part of the LocalNodeStore initialization, thus before newDaemon. That call set an initial IPv6 address in case it was missing (see makeIPv6HostIP).

This commit, part of https://github.com/cilium/cilium/pull/32381, removed that call in the LocalNodeStore initialization, leading to the above issue (there is another call to setDefaultPrefix that is done in ConfigureIPAM, after the check for the node IPv6 address).

pippolo84 avatar Jun 18 '25 15:06 pippolo84

Digging a little bit more into the problem, I think this is also related to https://github.com/cilium/cilium/pull/28953

My understanding is that:

  • In https://github.com/cilium/cilium/pull/28953 the initial node IPv6 address hard check was added in order to have a uniform behavior with different KPR settings when migrating from single stack to dual stack nodes (see https://github.com/cilium/cilium/issues/28909 and https://github.com/cilium/cilium/issues/34861#issuecomment-2651655820)
  • In https://github.com/cilium/cilium/pull/32381 the initial setup of an IPv6 Unique Local Address has been postponed after the hard check, thus leading to the failure above in environments where the IPv6 is not immediately available

I opened https://github.com/cilium/cilium/pull/40125 to move the check added in https://github.com/cilium/cilium/pull/28953 after ConfigureIPAM in the daemon initialization, thus restoring things as they were before https://github.com/cilium/cilium/pull/32381. But I'm wondering if that check is actually fulfilling its original intent. Since we set an ULA v6 address if the node does not have one yet, my understanding is that the check will always complete successfully. Am I missing something here @vipul-21 ?

Also, I think the solution proposed by @hexchen in https://github.com/cilium/cilium/issues/34861#issuecomment-2652221651 (see his work-in-progress commit here) might be better than the actual check, that seems to be responsible for other regressions similar to this one (see https://github.com/cilium/cilium/issues/34861).

cc @joestringer

pippolo84 avatar Jun 19 '25 11:06 pippolo84

@pippolo84 The understanding is correct. The setting of an ULA v6 address(if not present) is that something added later ? Because earlier when the check was added agent used to crash if the node does not have an ipv6 address. If we always have a v6 address then yes, the check will always complete successfully.

vipul-21 avatar Jun 19 '25 14:06 vipul-21

@pippolo84 The understanding is correct. The setting of an ULA v6 address(if not present) is that something added later ? Because earlier when the check was added agent used to crash if the node does not have an ipv6 address. If we always have a v6 address then yes, the check will always complete successfully.

Yep, the setting of the ULA address has been there before the check was added (below the code just before your commit):

https://github.com/cilium/cilium/blob/3f523d37af1d1a13a286d00c3ed9ff8747e49904/pkg/node/address.go#L128-L140

So I guess there was something else that cleared that value and led to the fatal error from https://github.com/cilium/cilium/blob/3f499b4568839c1bacbdbdb80904a9b9114b7e50/pkg/node/address.go#L210

However, I see that InitNodePortAddrs has been removed in https://github.com/cilium/cilium/commit/7df31944b6e130d373ad4c02931b404a64006d54#diff-9143180298a1f5ba5dc10fd98d87945d1381ef45b7347a639f85e13e84285ca2 (part of https://github.com/cilium/cilium/pull/29033). And with that, it has been removed the error that you reported in https://github.com/cilium/cilium/issues/28909. Therefore, that should not happen anymore since v1.15.

Given all of this, I think it would be better to remove the check added in https://github.com/cilium/cilium/pull/28953 altogether.

If you have a chance to test again the main branch after the removal of the check and find something wrong, we can investigate deeply your case and see what other solution we can put in place. WDYT?

pippolo84 avatar Jun 19 '25 16:06 pippolo84

If you have a chance to test again the main branch after the removal of the check and find something wrong, we can investigate deeply your case and see what other solution we can put in place. WDYT?

Yes that works, I can test and make sure it does not break the upgrade scenarios we have. 

vipul-21 avatar Jun 19 '25 17:06 vipul-21