gluon icon indicating copy to clipboard operation
gluon copied to clipboard

ath79-generic (e.g. WR1043ND v4) - WLAN Mesh broken when upgrading to v2022.1.x because of timing issue in boot process

Open rotanid opened this issue 2 years ago • 23 comments

When upgrading from Gluon v2021.1.x to v2022.1.x wlan mesh doesn't work anymore on a TP-Link TL-WR1043ND v4. The 802.11s mesh interface is shown in "iwinfo" but not on the status page or "batctl if" The upgrade process was tested from latest v2021.1.x branch (fresh install) to Gluon v2022.1, v2022.1.1 and v2022.1.2.

The problem does not appear when flashing with "forget settings" and reconfiguring the v2022.1.x firmware from scratch. The problem does not appear with WR1043ND v2 or v3:

A problem with the migration from ar71xx-generic to ath79-generic may have happened, although @AiyionPrime stated in #2431 that everything was working fine - so maybe the issue was introduced later than v2022.1(.0) ?

rotanid avatar Feb 04 '23 02:02 rotanid

This is a v4, that's been active and updated for the last four years: https://hannover.freifunk.net/karte/#/en/map/8416f99bd2d0

With vH31 it is currently running gluon v2022.1.1 (ab1fb054f6be33fdba74c3f3344d9ef746287c68).

Meshing does work and can be seen without problems on it's statuspage: http://[2001:678:978:213:8616:f9ff:fe9b:d2d0]/cgi-bin/status

Though I certainly might have overlooked something, I think the migration was fine. Our sample size is fairly limited though; most of the routers (75) are still on vH25 (ar71xx) and only three or four are ath79.

AiyionPrime avatar Feb 04 '23 05:02 AiyionPrime

tom reported on IRC that in darmstadt there was a similar issue with a TL-WR1043N v5: https://forum.darmstadt.freifunk.net/t/meshausfall-nach-2-6-0-in-dieburger-innenstadt/944

rotanid avatar Feb 04 '23 11:02 rotanid

i just tested the tag v2022.1 and the issue also happens with the initial release of this branch.

@AiyionPrime can you be sure, the linked device was never reconfigured manually? one probably cannot see this from the available data.

i'm currently trying to get my hands on a TL-WR1043N v5 to check if i can reproduce it.

it would also help to check it on a second TL-WR1043ND v4 - does anyone have it laying around and can test?

rotanid avatar Feb 06 '23 00:02 rotanid

No, I can not. We can send the owner an email and ask though, if that helps.

AiyionPrime avatar Feb 06 '23 15:02 AiyionPrime

Just some ideas. If they don't apply, then pls bear with me :) @rotanid Have you taken a look at what logread returns? (Maybe there are errors in there?) Once anyone gets their hands on an affected device: Is this reproducible on a fresh install? (Install 2021, set a few things in config mode, boot once, then update to 2022)

Djfe avatar Feb 06 '23 16:02 Djfe

Have you taken a look at what logread returns? (Maybe there are errors in there?)

serious question? well ok: no relevant info there as far as i can see.

Once anyone gets their hands on an affected device: Is this reproducible on a fresh install? (Install 2021, set a few things in config mode, boot once, then update to 2022)

that i have already done and also written in my bug report, i added "(fresh install)" now as it seems it hasn't been clear enough.

rotanid avatar Feb 07 '23 20:02 rotanid

ok my first suggestions were very basic. 😑😅

You probably figured this out, but maybe you can replicate what they did in Darmstadt and compare configs before and after saving config mode: install 2021 fresh config mode 2021, save update 2021 to 2022 get relevant info (uci show, /etc/config/wireless, ...) config mode 2022, save get relevant info (uci show, /etc/config/wireless, ...) again compare the two

Djfe avatar Feb 07 '23 23:02 Djfe

The only commit, that happened after adding the device would be https://github.com/openwrt/openwrt/commit/e826b642945ee7b196044a07faddd71c1bd6c6ef

Looking back at the old definition: https://github.com/openwrt/openwrt/blob/openwrt-19.07/target/linux/ar71xx/files/arch/mips/ath79/mach-tl-wr1043nd-v4.c#L62-L67 absolute MAC offsets on flash are v4 0x1ff50008 vs v5 0x1ff00008 It's obvious that mac's are placed in the same partition but the partition has a different offset. In the commit the definition is moved from the dtsi down to the correct dts files. This should've improved from upgrading the old ar71xx target, but maybe the commit broke something else. They also moved the calibration data for wifi along.

Maybe I found the issue?

#define TL_WR1043_V4_EEPROM_ADDR		0x1fff0000
#define TL_WR1043_V4_WMAC_CALDATA_OFFSET	0x1000

Old address of calibration data 0x1fff1000

New definition based on art partition, but art != eeprom(?)

art: partition@ff0000
mtd-cal-data = <&art 0x1000>;

New address of calibration data 0x1ff01000 (I added 0x1 manually, I assume it is calculated this way)

Possible fix? mtd-cal-data = <&art 0xf1000>;

Djfe avatar Feb 07 '23 23:02 Djfe

the only difference in config i could find is as follows:

> config interface 'mesh_radio0'
> 	option proto 'gluon_mesh'
> 

looking at the code, there may be an issue with "get_wlan_mac" during the upgrade from 2021.1.x and therefore the upgrade script returns before setting up the above section. https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L129

this correlates with what @Djfe found in OpenWrt.

anyone else can follow this argumentation? @blocktrron @NeoRaider @adschm ?

rotanid avatar Feb 07 '23 23:02 rotanid

I feel like we should print an error when there is no wmac to be found (Lines 130-131). This error could happen again for other devices and can be catched best by observing some form of log. Since logread is probably no option for init scripts, could we create a script that writes a log to flash that is only overwritten the next time, init scripts are run? (yes, there are devices with small flash storage, so we either have to keep the log small and pipe it through gzip on write after init complete, or we disable this for tiny style targets)

Such a log could be useful for adding new devices, too. It also allows catching mistakes in new code/dts files regarding the initialization. It would be useful alone for all silent returns in the lua file @rotanid linked above.

Djfe avatar Feb 08 '23 00:02 Djfe

i bought a TL-WR1043N v5 and tested it. this device has exactly the same problem :-( it was therefore erroneously tested as "working" in #2483

rotanid avatar Feb 09 '23 00:02 rotanid

Possible fix? mtd-cal-data = <&art 0xf1000>;

@Djfe i tested this, it doesn't work and instead soft-bricks the device on upgrade

rotanid avatar Feb 09 '23 02:02 rotanid

after looking a bit into it with the help of rmilecki from openwrt it seems like the issue may not be in the OpenWrt dts ...

rotanid avatar Feb 13 '23 21:02 rotanid

after a discussion in today's Gluon meetup we want to debug the band migration also: https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L211 https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/005-wireless-migration#L10 An idea was, to add "prints" or output the debug to some persistent file to find out which line breaks the upgrade scripts.

rotanid avatar Feb 15 '23 20:02 rotanid

so after many hours i'm closer to the problem - without a solution.

during first boot after upgrade when the upgrade scripts run, in 200-wireless the call to get_htmode fails and therefore no config update (the lines after) is written: https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L198 get_htmode fails in this line when trying to find the phy https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/200-wireless#L80 find_phy from wireless.lua fails here: https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/usr/lib/lua/gluon/wireless.lua#L56 this is the call to find_phy_by_path. during this first boot-upgrade-run the path is set to platform/qca956x_wmac but there is no path containing qca956x_wmac in /sys/devices/platform , therefore, neither of the following lines in find_phy_by_path can find any phy. https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/usr/lib/lua/gluon/wireless.lua#L22 https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/usr/lib/lua/gluon/wireless.lua#L27

the actual path of the phy would be /sys/devices/platform/ahb/18100000.wmac (at least on TL-WR1043ND v4) and later this path is correctly set. therefore, an additional run of the 200-wiress script fixes the problem.

seems to me like a timing issue during first boot after the sysupgrade. anyone with further ideas for debugging/fixing , e.g. @blocktrron @NeoRaider ?

rotanid avatar Feb 19 '23 23:02 rotanid

after talking about it with @NeoRaider on IRC we found out that it may be a timing issue. in rare cases the hotplug.d scripts seem to be run too late in procd context and therefore an important migration for ar71xx->ath79 is missing when gluon's upgrade scripts run. this is the openwrt migration script: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=target/linux/ath79/base-files/etc/hotplug.d/ieee80211/00-wifi-migration;h=f7393a0d0371bab38a70a7fdb93d558689c5c074;hb=refs/heads/openwrt-22.03 this should be run be procd. the upgrade scripts are run by uci-defaults and this call starts it: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/etc/init.d/boot;h=749d9e971141c63542e220bbd5c175f40041b174;hb=refs/heads/openwrt-22.03#l50

i verified the theory by adding a 5 second delay in one of gluon's first upgrade scripts here: https://github.com/freifunk-gluon/gluon/blob/master/package/gluon-core/luasrc/lib/gluon/upgrade/005-wireless-migration#L2

with that sleep-hack, the upgrade works fine!

so the already existing hack in OpenWrt seems to be too little: https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/base-files/files/etc/init.d/boot;h=749d9e971141c63542e220bbd5c175f40041b174;hb=refs/heads/openwrt-22.03#l46

it would be nice to find a solution that doesn't depend on timing but is deterministic... maybe @NeoRaider comes up with an idea, otherwise we might need to add some seconds of sleep in Gluon

rotanid avatar Feb 20 '23 01:02 rotanid

i created a pull request for the workaround: https://github.com/freifunk-gluon/gluon/pull/2792

this issue stays as long as we have no deterministic fix

rotanid avatar Feb 20 '23 23:02 rotanid

removing the issue from the milestones as workarounds have been implemented.

rotanid avatar Feb 25 '23 23:02 rotanid

If I remember correctly, the wifi startup somehow happens asynchronously and you simply cannot depend on it during procd startup. That's why you have to rely on these hotplug.d scripts if you want to configure anything after they have come up. But I might be wrong, it's been a while since I dealt with this stuff.

adschm avatar Mar 05 '23 19:03 adschm

should we revert this commit now? gluon master doesn't support upgrading from v2021.1.x any longer atm. (the bridges were burned unless anyone wants to step up and keep them maintained for the v2023.2 release)

Djfe avatar Sep 13 '23 17:09 Djfe

@Djfe I don't think there is anything specific to the update from 2021.1.x to 2022.1.x to this issue, it could easily occur on any upgrade that requires updating the wireless UCI config.

neocturne avatar Sep 13 '23 18:09 neocturne

@neocturne this could be the fix for our issue as well, no? https://git.openwrt.org/?p=project/netifd.git;a=commitdiff;h=516ab774cc16d4b04b3b17a067cbf2649f1adaeb;hp=40ed7363caf2b22b6e29ed9d9948189c2bc4c8f3 the issue which lead to this commit is here: https://github.com/openwrt/openwrt/issues/13598

rotanid avatar Nov 07 '23 23:11 rotanid

forget the above "fix", because jow wrote on IRC:

10:27:33 < jow > rotanid: well there is ieee80211 hotplug events which work for that. I think the reason for this particular hack is the fact that there's uci-defaults scripts which want to mangle the default wifi config 10:27:42 < jow > rotanid: and those uci-defaults script run very early 10:28:14 < jow > rotanid: a proper solution would be moving whatever logic is needed from uci-defaults into the wifi reconf code path

maybe someone has an idea how to implement this in order to replace the sleep-Hack

rotanid avatar Nov 09 '23 23:11 rotanid