Flatcar 3975.2.1 Bonding Config Bug
Description
We've encountered a problem with bonding configs., after our most recent Flatcar upgrade from v3760.2.0 to v3975.2.1. The behavior is very weird, in that the bond0 interface actor churn does not always begin after the initial upgrade reboot. Instead, the bond0 interface actor churn most frequently appears after a subsequent reboot.
- We can commonly recover from this by rebooting, but that does not always fix it
- We have tried downing and reupping the effected Bond0 interface but that doesn't seem to have any effect
- We tried to upgrade to the next known stable, 3975.2.2, but we see the same problem
- We tried downgrading to v3760.2.0 and that worked-- the interface no longer enters churn
- We then tried upgrading back to 3975.2.1, rebooting after the upgrade reboot, and churn reappeared
Impact
Nodes, rebooted after the initial upgrade reboot, go into churn on the secondary Bond0 interface and are subsequently unable to communicate with other nodes in the cluster.
Environment and steps to reproduce
- Set-up: [ describe the environment Flatcar/Nebraska etc was running in when encountering the bug; Platform etc. ] a. Baremetal Flatcar OS 3760.2.0 upgraded via Nebraska to Flatcar OS 3975.2.1
- Task: [ describe the task performing when encountering the bug ] a. After the node is upgraded and rebooted, the node is then rebooted a second time, and churn appears, which causes lag during node login and commands being run
- Action(s): [ sequence of actions that triggered the bug, see example below ] a. Rebooted the node, after the initial upgrade reboot b. Node login and commands begin to hang and take many seconds to minutes to complete c. /proc/net/bonding/bond0 shows churn on the secondary interface, and has no system mac address present
- Error: [describe the error that was triggered] a. Nodes were unable to communicate with effected node
Expected behavior
Expect nodes to communicate with other nodes in the cluster
Additional information
Please add any information here that does not fit the above format.
We were asked for the following:
underlying hardware (network devices / virt environments etc.)- Bare metal
- Problematic Flatcar OS 3975.2.1 & 3975.2.2
- Working Flatcar OS 3760.2.0
- Systemd 252
- Kubernetes versions 1.28.x to 1.31.x
- Server: Dell R6515
- Switch: Juniper EX4300-48T
dmesg
[ 18.646307] ice 0000:41:00.1 enp65s0f1np1: NIC Link is up 25 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: FC-FEC/BASE-R, Autoneg Advertised: On, Autoneg Negotiated: False, Flow Control: None
[ 18.666616] bond0: (slave enp65s0f1np1): Enslaving as a backup interface with an up link
[ 18.675470] bond0: Warning: No 802.3ad response from the link partner for any adapters in the bond
[ 18.686452] ice 0000:41:00.1 enp65s0f1np1: Error ADDING CP rule for fail-over
[ 18.693782] ice 0000:41:00.1 enp65s0f1np1: Shared SR-IOV resources in bond are active
[ 18.702648] ice 0000:41:00.0: Primary interface not in switchdev mode - VF LAG disabled
We were asked to try the following but are still seeing issues:
Create /etc/systemd/network/98-bond-mac.link- Add the following to the newly created /etc/systemd/network/98-bond-mac.link
[Match]
Type=bond
[Link]
MACAddressPolicy=none
Can you try the alpha releases between 3760 and 3975, this would help narrow it down:
- 3794
- 3815
- 3850 -> first with kernel 6.6
- 3874
- 3913 -> first with systemd 255
- 3941
@jepio
Please see below upgrade process and results:
- 3975.2.1
- Reboot X 3
- Have
churnon all 3 reboots- No
system mac addresson 2nd Bond0 interface during all 3 reboots
- No
- Downgraded to 3760.2.0
- 3760.2.0
- Reboot X 3
- No
churnon all three reboots- Has
system mac addresson 2nd Bond0 interface during all 3 reboots
- Has
- Upgraded to 3794.0.0
- 3794.0.0
- Reboot X 3
- No
churnon all three reboots- Has
system mac addresson 2nd Bond0 interface during all 3 reboots
- Has
- Upgraded to 3815.0.0
- 3815.0.0
- Reboot X 3
- No
churnon all three reboots- Has
system mac addresson 2nd Bond0 interface during all 3 reboots
- Has
- Upgraded to 3850.0.0
Please note the following is the first presence of churn
- 3850.0.0
- Reboot X 3
- Have
churnon 1st reboot- No
system mac addresson 2nd Bond0 interface during 1st reboot
- No
- No
churnon 2nd reboot- Has
system mac addresson 2nd Bond0 interface during 2nd reboot
- Has
- Have
churnon 3rd reboot- No
system mac addresson 2nd Bond0 interface during 3rd reboot
- No
- Upgraded to 3874.0.0
- 3874.0.0
- Reboot X 3
- Have
churnon 1st reboot- No
system mac addresson 2nd Bond0 interface during 1st reboot
- No
- Have
churnon 2nd reboot- No
system mac addresson 2nd Bond0 interface during 2nd reboot
- No
- No
churnon 3rd reboot- Has
system mac addresson 2nd Bond0 interface during 3rd reboot
- Has
- Upgraded to 3913.0.0
Please note the following has no presence of churn
- Upgraded to 3913.0.0
- Reboot X 3
- No
churnon all three reboots- Has
system mac addresson 2nd Bond0 interface during all 3 reboots
- Has
- Upgraded to 3941.0.0
Please note churn returns in the following
- Upgraded to 3941.0.0
- Reboot X 3
systemd 255- No
churnon first 2 reboots- Has
system mac addresson 2nd Bond0 interface during first 2 reboots
- Has
- Have
churnon 3rd reboot- No
system mac addresson 2nd Bond0 interface during 3rd reboot
- No
- Upgraded to 3975.0.0
- Upgraded to 3975.0.0
- Reboot X 3
systemd 255- No
churnon first 2 reboots- Has
system mac addresson 2nd Bond0 interface during first 2 reboots
- Has
- Have
churnon 3rd reboot- No
system mac addresson 2nd Bond0 interface during 3rd reboot
- No
- Upgraded to 3975.2.1
- Upgraded to 3975.2.1
- Reboot X 3
systemd 255- Have
churnon 1st reboot- No
system mac addresson 2nd Bond0 interface during 1st reboot
- No
- No
churnon 2nd reboot- Has
system mac addresson 2nd Bond0 interface during 2nd reboot
- Has
- Have
churnon 3rd reboot- No
system mac addresson 2nd Bond0 interface during 3rd reboot
- No
Hello, this looks to be a concurrency issue between the unit that enforces/creates the bond and the unit that enforces the /etc/systemd/network/98-bond-mac.link. Can you give more details, if possible, on how the bonds are configured - is it a butane/ignition config or another configuration file/agent? This would be valuable to reproduce the issue locally.
Also, is it possible to maybe try a version of Flatcar with a different kernel / systemd to see if the issue does still happen (you can find a Flatcar image artifact here with kernel 6.11 https://github.com/flatcar/scripts/actions/runs/11594744048 and one Flatcar image artifact here with systemd 256 https://github.com/flatcar/scripts/actions/runs/11557455799). My bet would be on a different systemd version.
Thanks.
Can we also compare networkctl list and networkctl status between a working and a broken version?
@jepio
Please see the information you requested below:
3975.2.1
networkctl list
IDX LINK TYPE OPERATIONAL SETUP
1 lo loopback carrier unmanaged
2 enp65s0f0np0 ether enslaved configured
3 enp65s0f1np1 ether enslaved configured
4 bond0 bond routable configured
networkctl status
Interfaces: 1, 2, 3, 4
State: routable
Online state: online
Address: x.x.x.x on bond0
x.x.x.x on bond0
x:x:x:x on bond0
x:x:x:x on bond0
Gateway: x.x.x.x on bond0
x:x:x:x on bond0
DNS: x.x.x.x
x.x.x.x
Dec 03 20:23:17 systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Dec 03 20:23:17 systemd-networkd[1990]: bond0: Configuring with /etc/systemd/network/20-bond0.network.
Dec 03 20:23:17 systemd-networkd[1990]: enp65s0f0np0: Link UP
Dec 03 20:23:17 systemd-networkd[1990]: enp65s0f1np1: Link UP
Dec 03 20:23:17 systemd-networkd[1990]: bond0: Link UP
Dec 03 20:23:17 systemd-networkd[1990]: enp65s0f0np0: Gained carrier
Dec 03 20:23:17 systemd-networkd[1990]: enp65s0f1np1: Gained carrier
Dec 03 20:23:17 systemd-networkd[1990]: bond0: Gained carrier
Dec 03 20:23:17 systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.
Dec 03 20:23:19 systemd-networkd[1990]: bond0: Gained IPv6LL
3760.2.0
networkctl list
IDX LINK TYPE OPERATIONAL SETUP
1 lo loopback carrier unmanaged
2 ens3f0 ether enslaved configured
3 ens3f1 ether enslaved configured
4 bond0 bond routable configured
networkctl status
State: routable
Online state: online
Address: x.x.x.x on bond0
x.x.x.x on bond0
x:x:x:x on bond0
x:x:x:x on bond0
Gateway: x.x.x.x on bond0
x:x:x:x on bond0
DNS: x.x.x.x
x.x.x.x
Dec 03 20:50:49 systemd[1]: Starting systemd-networkd-wait-online.service - Wait for Network to be Configured...
Dec 03 20:50:49 systemd-networkd[1714]: ens3f0: Configuring with /etc/systemd/network/00-nic.network.
Dec 03 20:50:49 systemd-networkd[1714]: bond0: Link UP
Dec 03 20:50:49 systemd-networkd[1714]: ens3f1: Link UP
Dec 03 20:50:49 systemd-networkd[1714]: ens3f0: Link UP
Dec 03 20:50:49 systemd-networkd[1714]: ens3f1: Gained carrier
Dec 03 20:50:49 systemd-networkd[1714]: bond0: Gained carrier
Dec 03 20:50:49 systemd-networkd[1714]: ens3f0: Gained carrier
Dec 03 20:50:49 systemd[1]: Finished systemd-networkd-wait-online.service - Wait for Network to be Configured.
Dec 03 20:50:51 systemd-networkd[1714]: bond0: Gained IPv6LL
@ader1990
First Question
Hello, this looks to be a concurrency issue between the unit that enforces/creates the bond and the unit that enforces the /etc/systemd/network/98-bond-mac.link. Can you give more details, if possible, on how the bonds are configured - is it a butane/ignition config or another configuration file/agent? This would be valuable to reproduce the issue locally.
- We use an ignition file. Here is the
networdpart of that file:
networkd:
units:
- name: 00-nic.network
contents: |
[Match]
Name=!bond0
MACAddress={{.mac1}} {{.mac2}} {{.mac_add}}
[Network]
Bond=bond0
- name: 10-bond0.netdev
contents: |
[NetDev]
Name=bond0
Kind=bond
MACAddress={{.mac1}}
[Bond]
TransmitHashPolicy=layer3+4
MIIMonitorSec=.1
UpDelaySec=.2
DownDelaySec=.2
Mode=802.3ad
LACPTransmitRate=fast
- name: 20-bond0.network
contents: |
[Match]
Name=bond0
[Network]
DNS={{ .dns1 }}
DNS={{ .dns2 }}
[Address]
Address={{.public_ip4}}
[Address]
Address={{.public_ip6}}
[Address]
Address={{.private_ip4}}
[Route]
Destination=x.x.x.x/x
Gateway={{.public_gw4}}
[Route]
Destination=x:x:x/x
Gateway={{.public_gw6}}
[Route]
Destination=x.x.x.x/x
Gateway={{.private_gw4}}
Second Question
Also, is it possible to maybe try a version of Flatcar with a different kernel / systemd to see if the issue does still happen (you can find a Flatcar image artifact here with kernel 6.11 https://github.com/flatcar/scripts/actions/runs/11594744048 and one Flatcar image artifact here with systemd 256 https://github.com/flatcar/scripts/actions/runs/11557455799). My bet would be on a different systemd version.
- I couldn't find a link to a release specifically
- What do I need to do?
Hello @DFYT42,
Can you try the Flatcar image from the artifacts tab produced by this github action run: https://github.com/flatcar/scripts/actions/runs/12375552738?pr=2145 ?
This image has a newer version of systemd (v256).
Thank you, Adrian.
@ader1990 Sorry for the delayed response. It looks like over the holiday break these artifacts may have expired. Can you let us know if there's somewhere else we can grab the necessary artifacts? Or do you have any idea if this will be in one of the upcoming alpha releases that we'd be able to test out?
Or separately, if there's somewhere we can point the installer to to grab this beforehand, we'd be happy to test from there as well.
@tylerauerbeck Yes, systemd v256 will be included in the alpha release next week. You could use the artifacts from the release to test or you could look into the images from the nightlies
@ader1990 @sayanchowdhury
We upgraded out test box from 3975.2.1 to 4230.0.0. Unfortunately, the churn problem still exists:
1st boot into 4230.0.0, after the upgrade: churn
2nd reboot: churn
3rd reboot: churn
4th reboot: no churn
5th reboot: churn & kicked out
6th reboot: no churn
7th reboot: no churn
8th reboot: churn
What do you think next steps should be for troubleshooting?
Thank you in advance!
Please pardon my naive question (I'm not really literate re: Linux network bonding) but shouldn't we first create the bond0 netdev and then add network devices to it? The way I read the config above, 00-nic.network runs before 10-bond0.netdev, shouldn't it be the other way around?
Hi @t-lo
This particular configuration has worked for other Flatcar versions and node spin ups.
Hi there @t-lo !
I apologize for the long response time but i did not actually hit comment on my response. This is a good thing though because I did the wrong thing for your suggestion. (I simply switched the order of the bonding configs without actually changing the weight number.)
So, after changing the weight number, and seeing the order change to the following, we are still seeing churn:
- 00-bond0.netdev
- 10-nic.network
- 20-bond0.network
We are on Flatcar 3975.2.1.
Hello, after some more digging I believe the issue may have been introduced in kernel v6.6.
We see the error Error ADDING CP rule for fail-over which was added in this commit which was first introduced in kernel v6.6.
Here are some additional logs between boots which resulted in a working and not working bond.
working:
[ 18.801464] bond0: (slave enp65s0f0np0): Enslaving as a backup interface with an up link
[ 18.801753] ice 0000:41:00.0 enp65s0f0np0: Shared SR-IOV resources in bond are active
[ 19.043889] ice 0000:41:00.1 enp65s0f1np1: Interface added to non-compliant SRIOV LAG aggregate
[ 19.043934] bond0: (slave enp65s0f1np1): Enslaving as a backup interface with an up link
[ 21.151260] bond0: active interface up!
not working:
[ 18.192189] bond0: (slave enp65s0f0np0): Enslaving as a backup interface with an up link
[ 18.212257] ice 0000:41:00.0 enp65s0f0np0: Shared SR-IOV resources in bond are active
[ 18.455885] bond0: (slave enp65s0f1np1): Enslaving as a backup interface with an up link
[ 18.456494] ice 0000:41:00.1 enp65s0f1np1: Error ADDING CP rule for fail-over
[ 18.471107] ice 0000:41:00.1 enp65s0f1np1: Shared SR-IOV resources in bond are active
[ 20.489834] bond0: active interface up!
I think some note worthy differences are:
Error ADDING CP rule for fail-overis seen on boots when bonding has issues.Shared SR-IOV resources in bond are activeis seen twice on boots when bonding has issues.Interface added to non-compliant SRIOV LAG aggregateis not logged for the second interface on boots when bonding has issues.
In addition to the logs, /proc/net/bonding/bond0 reports different Aggregator ID's between the first and second interface on boots where bonding is not working.
This may also relate to the Shared SR-IOV resources in bond are active log message being shown twice on boots where bonding is not working but only once when it is working.
It was suggested to try the latest dev build for the next release of Flatcar using Linux Kernel 6.12. Since upgrading, we have not been able to reproduce the issue.
@DFYT42 @t-lo Is there any rough estimate on what the 6.12 kernel will land in stable?
@DFYT42 @t-lo Is there any rough estimate on what the 6.12 kernel will land in stable?
I would roughly say around October/November~ (we usually have a 5/6 months between two new major stables and the last one was end of June) - I will let @sayanchowdhury to confirm.
That said, if someone has time to bissect the Kernel to find what commits is causing the breakage, we can try to submit a fix for Kernel 6.6 (we did that a few times)
I agree with @tormath1. Stable is still couple of months away. A good approach could be to backport the patch here.