After upgrading to v1.17.1 from v1.16, Failed to start with `No IPv6 support on node as ipv6 address is nil`
Is there an existing issue for this?
- [x] I have searched the existing issues
Version
equal or higher than v1.17.1 and lower than v1.18.0
What happened?
time="2025-02-23T06:17:49.806446531Z" level=error msg="No IPv6 support on node as ipv6 address is nil" ipAddr.ipv4=10.0.0.* ipAddr.ipv6="<nil>" nodeName=node-1 subsys=daemon
time="2025-02-23T06:17:49.806464032Z" level=error msg="unable to connect to get node spec from apiserver" error="node node-1 does not have an IPv6 address" subsys=daemon
time="2025-02-23T06:17:49.806830253Z" level=info msg="Waiting for all endpoints' goroutines to be stopped." subsys=daemon
time="2025-02-23T06:17:49.806857642Z" level=info msg="All endpoints' goroutines stopped." subsys=daemon
After agent is terminated.
I did a little bit of debugging
In v1.16.x, setDefaultPrefix in pkg/node/address.go sets the v6 address before if option.Config.EnableIPv6 && nodeIP6 == nil { is executed.
But, in v1.17.2, setDefaultPrefix is executed later.
How can we reproduce the issue?
helm values.yaml
autoDirectNodeRoutes: true
bgpControlPlane:
enabled: true
v2Enabled: false
enableIPv4Masquerade: false
enableIPv6Masquerade: false
ipam:
operator:
clusterPoolIPv4PodCIDRList: 10.0.0.0/16
clusterPoolIPv6PodCIDRList: fd00::/108
ipv6:
enabled: true
k8sServiceHost: 172.16.0.1
k8sServicePort: "6443"
kubeProxyReplacement: "true"
routingMode: native
envoy:
enabled: false
If I roll back to v1.16.7 everything works fine.
Cilium Version
v1.17.1
Kernel Version
6.8.0-53-generic #55-Ubuntu
Kubernetes Version
v1.32.2
Regression
v1.16.7
Sysdump
No response
Relevant log output
Anything else?
No response
Cilium Users Document
- [ ] Are you a user of Cilium? Please add yourself to the Users doc
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
@jungeonkim Is it resolved if you set k8s.requireIPv6PodCIDR: true in your Helm values?
@jungeonkim Is it resolved if you set
k8s.requireIPv6PodCIDR: truein your Helm values?
I tried that but it didn't work.
Observing the same issue here. CiliumNode running 1.16 gets spec.addresses entries for both the ipv4 and ipv6 addresses of the node, while 1.17 CiliumNode does not.
My base Node objects do not have their ipv6 address in their status.addresses list (running k3s dual-stack). Does cilium want this populated now for ipv6 to work?
I believe https://github.com/cilium/cilium/pull/38472 is the fix and is slated to be in v1.17.3. If you can try it now, you can deploy the latest dev version of Cilium based off main.
I just ran quay.io/cilium/cilium:v1.18.0-pre.1@sha256:88711d5016c6969e47e92e5f499ccd80e3df93ab52bdd7bc321c2b4a6a434a9e to test (as the release notes specify that patch is in).
Still seeing time="2025-04-09T13:23:11.205268888Z" level=fatal msg="failed to start: daemon creation failed: unable to connect to get node spec from apiserver: node server01 does not have an IPv6 address\nfailed to stop: unable to find controller ipcache-inject-labels" subsys=daemon
Correctly supplying the kind: Node objects with both ipv4 and ipv6 addresses (in my case hardcoding ipv4/ipv6 in k3s config instead of relying on 0.0.0.0 magic for kubelet node-ip arg) lets cilium start up successfully, as a workaround.
@oivindoh Interesting. Sounds like a separate issue. The fact that your workaround is working confirms that the PR is the fix.
Same issue here(I used cilium v1.17.3), I try to patch an IPv6 address to ciliumNode, but it seems cilium will revert it.
Same issue here with version 1.17.1 and also 1.17.3. With a roll back to 1.16.9 it works.
Today I tested it with version 1.17.4 and still the same problem but version 1.16.10 is working.
@christarazi today I also tested the v1.18.0-pre.2 image but also the same problem.
I would like to fix this, I will look into it later.
This is related to this bug in this thread https://github.com/siderolabs/talos/issues/8115#issuecomment-2307280220
With a workaround where you change the pod and service-subnets to try IPv6 before IPv4.
(also hitting this right off main, my bond0 dev did have an IPv6 address though)
I'm currently unable to repro this in a local kind cluster, but I believe that, as @jungeonkim suggested, the issue is related to the missing call to SetDefaultPrefix that in v1.16 was happening as part of the LocalNodeStore initialization, thus before newDaemon. That call set an initial IPv6 address in case it was missing (see makeIPv6HostIP).
This commit, part of https://github.com/cilium/cilium/pull/32381, removed that call in the LocalNodeStore initialization, leading to the above issue (there is another call to setDefaultPrefix that is done in ConfigureIPAM, after the check for the node IPv6 address).
Digging a little bit more into the problem, I think this is also related to https://github.com/cilium/cilium/pull/28953
My understanding is that:
- In https://github.com/cilium/cilium/pull/28953 the initial node IPv6 address hard check was added in order to have a uniform behavior with different KPR settings when migrating from single stack to dual stack nodes (see https://github.com/cilium/cilium/issues/28909 and https://github.com/cilium/cilium/issues/34861#issuecomment-2651655820)
- In https://github.com/cilium/cilium/pull/32381 the initial setup of an IPv6 Unique Local Address has been postponed after the hard check, thus leading to the failure above in environments where the IPv6 is not immediately available
I opened https://github.com/cilium/cilium/pull/40125 to move the check added in https://github.com/cilium/cilium/pull/28953 after ConfigureIPAM in the daemon initialization, thus restoring things as they were before https://github.com/cilium/cilium/pull/32381. But I'm wondering if that check is actually fulfilling its original intent. Since we set an ULA v6 address if the node does not have one yet, my understanding is that the check will always complete successfully. Am I missing something here @vipul-21 ?
Also, I think the solution proposed by @hexchen in https://github.com/cilium/cilium/issues/34861#issuecomment-2652221651 (see his work-in-progress commit here) might be better than the actual check, that seems to be responsible for other regressions similar to this one (see https://github.com/cilium/cilium/issues/34861).
cc @joestringer
@pippolo84 The understanding is correct. The setting of an ULA v6 address(if not present) is that something added later ? Because earlier when the check was added agent used to crash if the node does not have an ipv6 address. If we always have a v6 address then yes, the check will always complete successfully.
@pippolo84 The understanding is correct. The setting of an ULA v6 address(if not present) is that something added later ? Because earlier when the check was added agent used to crash if the node does not have an ipv6 address. If we always have a v6 address then yes, the check will always complete successfully.
Yep, the setting of the ULA address has been there before the check was added (below the code just before your commit):
https://github.com/cilium/cilium/blob/3f523d37af1d1a13a286d00c3ed9ff8747e49904/pkg/node/address.go#L128-L140
So I guess there was something else that cleared that value and led to the fatal error from https://github.com/cilium/cilium/blob/3f499b4568839c1bacbdbdb80904a9b9114b7e50/pkg/node/address.go#L210
However, I see that InitNodePortAddrs has been removed in https://github.com/cilium/cilium/commit/7df31944b6e130d373ad4c02931b404a64006d54#diff-9143180298a1f5ba5dc10fd98d87945d1381ef45b7347a639f85e13e84285ca2 (part of https://github.com/cilium/cilium/pull/29033). And with that, it has been removed the error that you reported in https://github.com/cilium/cilium/issues/28909. Therefore, that should not happen anymore since v1.15.
Given all of this, I think it would be better to remove the check added in https://github.com/cilium/cilium/pull/28953 altogether.
If you have a chance to test again the main branch after the removal of the check and find something wrong, we can investigate deeply your case and see what other solution we can put in place. WDYT?
If you have a chance to test again the main branch after the removal of the check and find something wrong, we can investigate deeply your case and see what other solution we can put in place. WDYT?
Yes that works, I can test and make sure it does not break the upgrade scenarios we have.