meta-balena icon indicating copy to clipboard operation
meta-balena copied to clipboard

NetworkManager occasionally fails to activate AP profile properly

Open majorz opened this issue 2 years ago • 1 comments

When NetworkManager creates an access point with Internet connection sharing enabled it adds a number of iptables rules. Such an access point profile activated on boot may partially fail due to collision with balenaEngine. balenaEngine at the same time creates iptables rules as well, but holds the xtables.lock file, so some of the rules NetworkManager creates may fail. The problem looks to be within NetworkManager as it does not use -w, --wait option of iptables.

This issue means that our hotspot documentation example does not always work: https://www.balena.io/docs/reference/OS/network/2.x/#creating-a-hotspot

After we reproduce it locally and confirm that the missing wait of NetworkManager's usage of iptables is the root of the issue, we may file a bug report towards NetworkManager and possibly include a NetworkManager patch temporary until the issue is resolved upstream.

Current workaround is to create the AP profile from a container, or reactivate it there, or inject any missing rules from a container.

An example log that illustrates the failure (it was reported that this can happen on arbitrary rules):

Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.3579] Executing: /usr/sbin/iptables --table filter --insert INPUT --in-interface wlan0 --protocol tcp --destination-port 53 --jump ACCEPT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.3666] Executing: /usr/sbin/iptables --table filter --insert INPUT --in-interface wlan0 --protocol udp --destination-port 53 --jump ACCEPT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.3769] Executing: /usr/sbin/iptables --table filter --insert INPUT --in-interface wlan0 --protocol tcp --destination-port 67 --jump ACCEPT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.3867] Executing: /usr/sbin/iptables --table filter --insert INPUT --in-interface wlan0 --protocol udp --destination-port 67 --jump ACCEPT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.3971] Executing: /usr/sbin/iptables --table filter --insert FORWARD --in-interface wlan0 --jump REJECT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.4172] Executing: /usr/sbin/iptables --table filter --insert FORWARD --out-interface wlan0 --jump REJECT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.4306] Executing: /usr/sbin/iptables --table filter --insert FORWARD --in-interface wlan0 --out-interface wlan0 --jump ACCEPT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.4387] Command returned exit status 4.
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.4389] Executing: /usr/sbin/iptables --table filter --insert FORWARD --source 10.42.0.0/255.255.255.0 --in-interface wlan0 --jump ACCEPT
Apr 05 22:00:12 9e8f192 NetworkManager[1571]: [1649196012.4478] Command returned exit status 4.

majorz avatar Apr 13 '22 15:04 majorz

[majorz] This issue has attached support thread https://jel.ly.fish/1cddcde2-71d0-478a-8187-14ff3982f0da

jellyfish-bot avatar Apr 13 '22 15:04 jellyfish-bot

[majorz] This has attached https://jel.ly.fish/be7aa45f-1dd6-453e-b3ae-45be377cd696

jellyfish-bot avatar Nov 18 '22 08:11 jellyfish-bot

Related is https://github.com/balena-os/meta-balena/issues/2872 as we need to update NetworkManager to latest stable before patching it as that part of the code base was refactored.

majorz avatar Nov 21 '22 13:11 majorz

Starting to work on this now.

majorz avatar Dec 16 '22 09:12 majorz

I could not reproduce manually, so I created a small application to automate the process: https://github.com/balena-io-experimental/iptables-racing-2581. Going to try this in different ways.

majorz avatar Dec 19 '22 14:12 majorz

So far I have not been able to reproduce the racing condition on a few device types with different NetworkManager versions. I am going to try some extra ideas.

majorz avatar Dec 20 '22 17:12 majorz

If able to reproduce this here as has been done by Balena support before, happy for you to use this for testing - https://dashboard.balena-cloud.com/devices/eae2c9461555b2b5adda532ff1f04c6c

This device is a more accurate reflection of our current generation systems also. Is offline atm, but I've asked for it to be turned back on for your testing. https://dashboard.balena-cloud.com/devices/c209aebb5589435f8ce40d7b21a3ff75

louisburton avatar Dec 20 '22 19:12 louisburton

Hope you've had good holidays Zahari. Were you able to attempt reproduction on the devices above as I believe you reproduced the problem previously on these. Let me know if there's anything else I can provide @majorz. Thanks!

louisburton avatar Dec 28 '22 13:12 louisburton

I was able to find a way to reproduce this locally each time (after getting back from a long holiday). Starting to work on the patch now.

majorz avatar Jan 10 '23 12:01 majorz

The following PR fixes this (still in draft state): https://github.com/balena-os/meta-balena/pull/2963

I also created an issue upstream to hear the maintainers feedback for this: https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1182

majorz avatar Jan 10 '23 17:01 majorz

Thx! Great to see the upstream fix also. Although would note CI is failing - not sure if that's a warning for this current draft PR. https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/commit/686a5e792f6ac0f71347c836d490df43e949ceb3#note_1716991

Once rolled into an OS upgrade, happy for the devices linked above to be used to confirm issue, upgrade and confirm resolved.

louisburton avatar Jan 11 '23 11:01 louisburton

I've asked now whether the CI fail is related to the change. Awaiting on the answer and going to incorporate the upstream fix instead of mine if everything is alright.

majorz avatar Jan 11 '23 15:01 majorz

Great news on the merge, thx for your efforts @majorz ! Do we have to wait for a NetworkManager release? I am wary that will take a further 2 weeks? Can we perhaps do a test OS with a branch NM release that confirms you don't see evidence of reproduction on one of our test devices? This issue has dragged a long time unfortunately and I'm wary to get a fix out asap.

louisburton avatar Jan 16 '23 11:01 louisburton

I included the upstream version of the patch in the related PR.

majorz avatar Jan 16 '23 11:01 majorz

Now awaiting on the CI checks to pass and will try to merge it asap, so that we get a new release soon as well.

majorz avatar Jan 16 '23 11:01 majorz

@majorz The related PR also got merged yesterday kindly by Alex.

Does this mean we're able to now produce an OS release that includes this?

Can we please produce a release for the apolloOS balena device type, and could we then upgrade the above linked systems and confirm what was reproducible is no longer reproducible? 🙏 (it is less obvious for me to spot when iptables is corrupt): https://github.com/balena-os/meta-balena/issues/2581#issuecomment-1360061901

Thanks!

louisburton avatar Jan 19 '23 15:01 louisburton

@louisburton Just sent you an update on the other thread. Closing this now as the fix already landed in meta-balena.

Fixed by #2963.

majorz avatar Jan 24 '23 15:01 majorz