gluon icon indicating copy to clipboard operation
gluon copied to clipboard

v2018.2.x wifi mesh link loss

Open 2tata opened this issue 5 years ago • 37 comments

I recognize that wifi meshing functionality of gluon-v2018.2.1-13 (eed810aac1b0f6795622907b0de7dbc0fbfadc9d) seems partly to be broken. I've update some devices from gluon-v2017.1.8-7 (e968a225be0efafbd2d4638d0bd530e2d4e841f3) to gluon-v2018.2.1-13 (eed810aac1b0f6795622907b0de7dbc0fbfadc9d) and the meshing with 11s seems broken to older firmware as well as to gluon-v2018.2.1-13 partly. Are there kind of problems with the wifi driver?

Bug report

What is the problem?

The node TataWilh210-SO(gluon-v2018.2.1-13) was able to see ffnw-galaxy2(gluon-v2017.1.8-7) with nearly 80-90% but now its lower that 10% same for TataWilh510-NW(gluon-v2018.2.1-13) (got bricked while trying to downgrade it from v2018.2.1-13 to v2017.1.8-7) to OldenburgRaiffeisenLappan5(gluon-v2018.2.1-13) but before the link wars almost 60-70% and now its gone.

A more extreme case both OldenburgRaiffeisenMosle5(gluon-v2018.2.1-13) and OldenburgRaiffeisenLappan5(gluon-v2018.2.1-13) were able to see Oldenburg-Lappan-kismet(gluon-v2017.1.8-7) with almost 100% bidirectional. Now its 96% - 0%. The zero (or 1%) in one direction is striking.

Affected devices were: TP-Link TL-WR1043N/ND v1 Ubiquiti Loco M XW TP-Link TL-WR741N/ND v4 Ubiquiti UniFi-AC-PRO TP-Link CPE210 v1.1 TP-Link CPE510 v1.1

Site Configuration:

gluon-v2018.2.1-13-gc41a1a64: sideconf gluon-v2017.1.8-7: sideconf

2tata avatar Jun 23 '19 08:06 2tata

do you have graphs backing/showing this? we didn't see an issue with ~60 experimental nodes and @mweinelt also doesn't see the issue. Also, do you have devices which aren't affected? If there really would be a problem, it may be narrowed down to a driver.

nevertheless, if there really is a problem, it isn't a blocker for v2018.2.2 release as it would most likely not be a regression from v2018.2.1

EDIT: please note, the "bug" tag does only mean that someone reports a bug, not that it definitely is one.

rotanid avatar Jun 23 '19 11:06 rotanid

  • Please check whether the v2018.2 release is affected
  • To find the cause of the issue, please bisect to find the first affected commit (on one device, while keeping the other side unchanged)

Note that for downgrades, you must always flash without keeping config (sysupgrade -n).

neocturne avatar Jun 23 '19 13:06 neocturne

I just saw the same problem with one node in a larger mesh. It immediately fixed itself until the next stats run. I was unable to get a screenshot so fast. The node had 97% percent on one side, 0% on the other. At least for my setup, it solved itself without reboot. [Maybe some OGMs got missing? Just a guess.]

What you mean is something like that, right?

image

kevin-olbrich avatar Jun 24 '19 08:06 kevin-olbrich

Kloepfer2 kloepfer3

I know this problem and it's worse. It's probably not a wifi-related issue; i got this even with (one) cable-mesh-links. One direction suddenly drops to 0%; nodes go offline, but is in some cases reachable from their direct (one sided) next hop. I've got this more than once a day; most of the time it repairs itself within some hours; else a reboot resolves the problem. The issue is new and happens to links existing for years.

I guess it needs a bigger mesh with some bad connections to happen; but the affected connection can be quite good.

Examples: https://map.ffmuc.net/#!/de/map/6466b37ba442-6466b37b8d9e (cable) https://map.ffmuc.net/#!/de/map/68725112d104-60e327e6fe16 (wifi)

Affected models/links: Loco-841 (wifi) NSM-841 (wifi) wdr3600-wdr3600 (cable) Loco-NSM (wifi) wdr4300-wdr4300 (wifi)

DerKalle avatar Jun 25 '19 18:06 DerKalle

It immediately fixed itself until the next stats

This was no reboot? Can you obtain a syslog of such an event and push it to a pastebin?

Adorfer avatar Jun 26 '19 01:06 Adorfer

Not sure but some comments of the issue (e.g. by @DerKalle) sound similar to #1148 It could be related with https://bugs.openwrt.org/index.php?do=details&task_id=863 (but it could be something different, too)

ghost avatar Jun 26 '19 01:06 ghost

I have a ArcherC7v2 with gluon2018.2.1 having the same problem (loss of mesh-link) from time to time. But this problem exists since some years now also with earlier builds. My workaround is a watchdog pinging the uploader-IPv6 and do a wifi if ping fails. Giving me a message via webhook after a successfull wifi tells me, the one week it happens once to 3 times a day the next week it don't happens for many days. An other 841 on gluon 2018.2.x (and earlier) with uplink (in heavy use) and mesh-neighbours needs a watchdog for the wireless-link like: iwinfo | grep -A 6 $DEV | grep "Bit Rate: unknown" && { echo "wireless mesh down!" ...... to be fixed from time to time with a "wifi".

But on the other hand most of the other 600+ nodes in our net are running without these problems.

tackin avatar Jun 28 '19 06:06 tackin

@tackin for the record: about which kind of "loss" we are talking now? Fo me it looks that we are mixing different scenarios/symptoms of "wifi not so good as in 2016.2" into one pot:

How to categorize? a) total loss of mesh links at a node "since update to version x" (systematic) b) sporadic loss of mesh links at a node "since update to version x" ('after runtime") c) constant degradation of mesh links at a node "since update to version x" (systematic) d) sporadic degradation of mesh links at a node "since update to version x" ("after runtime")

This issue from @2tata looks for me more like "scenario c", while you are talking about "scenario b". Correct?

Adorfer avatar Jun 29 '19 17:06 Adorfer

@Adorfer Correct. I'm taking about "scenario b".

tackin avatar Jun 29 '19 20:06 tackin

Das Problem ist auch mit dem aktuellen v2018.2.2 an einer anderen stelle, diesen mal in Osnabrück ersichtlich Reboots, reflashing usw. bringen nix. Problem bleib persistent. https://map.ffnw.de/#!/en/map/f81a67eeda96

2tata avatar Jun 30 '19 20:06 2tata

Es betrifft seltsamerweise nicht alles links aber die, die betroffen sind sind immer betroffen.

2tata avatar Jun 30 '19 20:06 2tata

Könnte "minimum-MCS" (die diversen "rates") hier eine Rolle spielen? Also z.B. "alte Firmware ließ noch 2MBit/s Links zu. oder noch 6000er (habe die Raten gerade nicht im Kopf)

Adorfer avatar Jul 01 '19 01:07 Adorfer

On July 1, 2019 1:06:14 AM UTC, Adorfer [email protected] wrote:

Könnte "minimum-MCS" (die diversen "rates") hier eine Rolle spielen? Also z.B. "alte Firmware ließ noch 2MBit/s Links zu. oder noch 6000er (habe die Raten gerade nicht im Kopf)>

Ich denke nicht. Es sind links betroffen die zuvor mit 90-100% angezeigt wurden.

Es wirkt eher so als können sich die Geräte teilweise nicht mehr sehen oder als würde beim Link Aufbau des 11s irgendwas zeitlich nicht passen.

2tata avatar Jul 01 '19 05:07 2tata

Ich denke nicht. Es sind links betroffen die zuvor mit 90-100% angezeigt wurden.

please be aware that TQ ("packet loss") is not related to linkspeed (MCS etc). So you may have a "95%TQ link at 11b" (2MBits) and a "45% TQ" at 54000bps. off course both values are just examples. But i'd like you to verify if possible (by going back to the old FW on such a pair of nodes which lost their link totally): What was the linkspeed you had with 2016.2(?) on affected nodes? You will probabley need iw/iwinfo for that.

Adorfer avatar Jul 01 '19 08:07 Adorfer

Ich denke nicht. Es sind links betroffen die zuvor mit 90-100% angezeigt wurden.

please be aware that TQ ("packet loss") is not related to linkspeed (MCS etc). So you may have a "95%TQ link at 11b" (2MBits) and a "45% TQ" at 54000bps. off course both values are just examples.

Ah sure this could be.

But i'd like you to verify if possible (by going back to the old FW on such a pair of nodes which lost their link totally): What was the linkspeed you had with 2016.2(?) on affected nodes? You will probabley need iw/iwinfo for that.

No as I worte in the opening text the updtae was from gluon-v2017.1.8-7 (e968a22) to gluon-v2018.2.1-13 (eed810a). I was assuming the 2MBits (802.11b) speed rate was already dropped in v2016.2? Or do I miss some changes in v2018.2.x ?

2tata avatar Jul 01 '19 08:07 2tata

WiFi rates have been configurable since 6ff94aca3573913f957bf31b9042d143e46d57e1.

That has been deprecated in 4f60f6dbc6705e4b7d40f1e508c121d342c0c8f8, where 802.11b is now disabled by default.

mweinelt avatar Jul 01 '19 09:07 mweinelt

I hope to get some more information during a bigger event on the weekend when create a big mesh of different routers. Maybe we will gather some useful information there.

But be aware we are running BATMAN_V so I don't know if our problem will match ...

awlx avatar Jul 16 '19 12:07 awlx

@awlx ~~But please be aware, our problem is depending on CLIENTS not on MESH-neighbors.~~ upps ... sorry I triggered wrong issue.

tackin avatar Jul 16 '19 14:07 tackin

So ... the first evening passed in this big mesh: https://map.ffmuc.net/#!/en/map/6466b3ded7df

And we already tracked down some issues.

  1. ath9k-broken-wifi workaround caused many aborts of mesh-links ... we build a experimental firmware without it
  2. gw_sel_class 1500 seems to be waaaayyy to low for BATMAN_V we will raise it to the default of BATMAN_V => 5000 https://www.open-mesh.org/issues/366
  3. Wifi Mesh seems to trigger the issue way more often

awlx avatar Jul 19 '19 22:07 awlx

No as I worte in the opening text the updtae was from gluon-v2017.1.8-7 (e968a22) to gluon-v2018.2.1-13 (eed810a). I was assuming the 2MBits (802.11b) speed rate was already dropped in v2016.2? Or do I miss some changes in v2018.2.x ?

@2tata so in Gluon the forced removing was only done in master branch, not in v2018.2.x have you changed your site.conf in this regard?

rotanid avatar Aug 31 '19 14:08 rotanid

No as I worte in the opening text the updtae was from gluon-v2017.1.8-7 (e968a22) to gluon-v2018.2.1-13 (eed810a). I was assuming the 2MBits (802.11b) speed rate was already dropped in v2016.2? Or do I miss some changes in v2018.2.x ?

@2tata so in Gluon the forced removing was only done in master branch, not in v2018.2.x have you changed your site.conf in this regard?

I,ve just the default from the gluon docs: ffnw site

2tata avatar Sep 09 '19 18:09 2tata

ok, so it was already that way before upgrading according to the history of the linked site.conf..... old and new fw seem to use 11s , too... i'm out of ideas i guess :(

rotanid avatar Sep 09 '19 23:09 rotanid

Stumbling over this issue trying to figure out what's happening in a network ... nevertheless, my thing appears to be unrelated... still ideas.

At first it is relevant to identify the culprit. The numbers / stats of Freifunk maps are influenced by a lot different systems it is almost impossible to identify the actual cause:

  1. Looking at node maps / stats shows numbers from a web-application, usually aggregated from a time series database. Bugs in queries (especially influxdb time aggregation) can result in seeing non-existing values (e.g. having a link at 100% and a point in time not processed yet, yields an average at 50% in certain influxdb queries => You will see 50% link quality from time to time). If you want to know the actual numbers, you have to query influxdb or respondd manually.

  2. All statistics are gathered and transmitted using respondd. It is not given per-se, that respondd is bug free. To verify their correctness, one has to connect to a router and execute the actual command. Also, packet loss due to overload situations can result in respondd packets getting lost.

  3. After ssh'ing to a broken device, you can debug the setup using batctl. It offers a lot of options for diagnoses. https://downloads.open-mesh.org/batman/manpages/batctl.8.html - as a first step, try to check if the values on the map actually correspond to things you're observing using batctl.

  4. batman evaluate links using broadcasting. It is possible, that the numbers are correct but do not reflect the quality of a link for internet traffic. Typically, batman-adv use broadcast frames to evaluate the quality of a link (let's ignore batman_v for the moment). Unlike general traffic, broadcasts use a fixed datarate and no retransmissions. This causes:

    1. broadcasts to be prone to high load situations. If two senders transmitt at the same time, the broadcast is lost, whereas regular data is retransmitted.
    2. broadcasts have a longer range. The slow data rate is execellent for links having a few hundered meters (up to kilometers). Regular internet traffic however, is prone to loss due to timeouts (if a link is too long, an acknowledgement is too late and the packet is loss) and minstrel probing different rates. A nice illustration is given on page 8 in https://www.researchgate.net/publication/273069659_Link_Calibration_in_QoS-Aware_Wireless_Back-Haul_Networks_for_Rural_Areas as)
  5. Now thinks get difficult. You have to debug, if batman-adv is correctly discovering the topology (and did a bad job, hence), if batman-adv fails to detect the topology as specified - or, last resort - if the topology is skewed and there's likely to be a bug in the kernel or (wifi) driver. Typically, you start analyzing the topology and radio / VPN situation. iw does a good job in debugging the 802.11s topology. First, iw dev $WIFI_MESH_IFACE station dump gives you the status of all connected nodes. This is the topology reported by the linux wifi system (only 1-hop links).

Station 00:0a:f5:2b:40:f0 (on wlan1)
        inactive time:  8230 ms
        rx bytes:       135761
        rx packets:     1123
        tx bytes:       227923
        tx packets:     569
        tx retries:     26
        tx failed:      5
        rx drop misc:   0
        signal:         -33 [-35, -36] dBm
        signal avg:     -33 [-36, -36] dBm
        tx bitrate:     150.0 MBit/s MCS 7 40MHz short GI
        rx bitrate:     6.0 MBit/s
        expected throughput:    46.875Mbps
        authorized:     yes
        authenticated:  yes
        associated:     yes
        preamble:       short
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    2
        beacon interval:100
        short preamble: yes
        short slot time:yes
        connected time: 1869 seconds

Observe tx retries - this value denotes retransmissions for unicast data. A retransmission of an unicast frame, corresponds to a loss of a broadcast frame, unless it appears due to minstrel. However, excluding minstrel by setting fixed data rates is last-resort and results in calibrating a link Try to get some intuition for this values, instead.

You can also try iw dev mesh mpath dump (c.f. https://github.com/o11s/open80211s/wiki/mpath) - however, that's a bit boring, if you're using batman-adv instead of HWMP. Another important metric is airtime. It somewhat gives you the remaining capacity for transmitting data and loss due to overload.

iw dev wlan0 survey dump
Survey data from wlan0
        frequency:                      2412 MHz
        channel active time:            3 ms
        channel busy time:              3 ms
        channel receive time:           0 ms
        channel transmit time:          0 ms
Survey data from wlan0
        frequency:                      2417 MHz
Survey data from wlan0
        frequency:                      2422 MHz
Survey data from wlan0
        frequency:                      2427 MHz
Survey data from wlan0
        frequency:                      2432 MHz [in use]
        noise:                          -88 dBm
        channel active time:            1473079612 ms
        channel busy time:              191941128 ms
        channel receive time:           118895968 ms
        channel transmit time:          48977721 ms

All values are based on wall clock time and you can observe the packets. Mind, that due to DCF an available airtime of "50%" does not imply a beacon loss of "50 %", but it helps to understand packet loss due to overload situations.

Some rules of thumb, that could help:

  1. Detect and avoid overload situations (CPU, packets).They'll have a serve impact on batman-adv's ability to detect links correctly
  2. Although discussed here, issues related to batman-adv and wifi drivers do not concern gluon alone and ought to be addressed to the openmesh and openwrt community (respectively)
  3. Try to understand the wireless situation (e.g. using Android Apps or wireshark). Are there a lot of retransmissions? Are there transmissions at all? Do you see 802.11s traffic in wirehark - does it look reasonable?
  4. Do not use Freifunk Map information for debugging. Those values gives you some idea for problems but are not detailed enough for debugging.

yanosz avatar Mar 02 '20 11:03 yanosz

@yanosz Thanks, maybe move it to the wiki instead?

@2tata I don't see this issue anywhere here, is this issue still relevant?

mweinelt avatar Mar 03 '20 00:03 mweinelt

ok - what wiki are you thinking of? Maybe the gluon docs suit better (e.g. Section "Debugging Mesh")?. I'm no sure.

The guide addresses the debugging happened in this ticket and tries to outline, that Freifunk maps are hardly suitable for detailed debugging. Nevertheless, I skipped some remarks on detecting node problems (kernel ringbuffer, sys/proc filesystem), that appeared to be irrelevant in this case. IMHO it'd great, if other people could share experience, too and have a wiki article on that.

Regarding this ticket, IMHO think it is likely ok to close it as "invalid". Nevertheless, I'm not the one to decide this. IMHO information in this ticket is rather vague and has hardly anything to do with gluon as of this github project. It concerns only data from a 3rd party web application that does not allow to derive any details relevant for gluon ..

From my understanding, the configuration generated and used in gluon as of this github project is correct. Hence, the ticket should be filed for

  • OpenWRT, if it's WLAN driver problem
  • batman-adv, if it's a topology discovery issue
  • respondd, if data is wrong
  • To the map software, if data is not displayed correctly
  • To local Freifunk communities, if somebody tries to build a network that does not perform

Nevertheless, we can make this a ticket for gluon docs, like: "How to detect problems in your mesh and report problems accordingly". For project hygiene, I'd to suggest opening a new ticket instead. This avoids people reading all the comments.

[Edit] Personal note: Having gluon being based on an OpenWRT branch not released by the OpenWRT project, can cause WLAN / driver and kernel combinations not used by many people in the OpenWRT commnunity. Hence, problems are known less. IMHO this is somewhat unfortunate - both projects would profit, if releases would be in sync - but this is out of scope for this ticket. [/Edit]

yanosz avatar Mar 03 '20 10:03 yanosz

@2tata I don't see this issue anywhere here, is this issue still relevant?

We have plenty of them: https://map.ffnw.de/#!/en/map/10feedaf4426-788a20f295c3

Its looks extremely after the good old ath9k bug...

2tata avatar Mar 03 '20 19:03 2tata

@2tata, I guess it helps to provide more information according the guide I posted.

yanosz avatar Mar 03 '20 22:03 yanosz

as there were no detailed reports about such an issue with any of the newer releases, i'm closing this issue.

if you want to report a new/similar issue, please also read what yanosz wrote about the details which need to be provided, we can't work with "there's a problem" and most likely such an issue is not a Gluon one, as yanosz also wrote.

rotanid avatar Sep 02 '20 23:09 rotanid

Affected Nodes: https://map.ffnw.de/#!/c46e1f415bcc-6872510a334e https://map.ffnw.de/#!/0c8063e70d13-6872510a334e https://map.ffnw.de/#!/b04e26b09bf8-8416f9c8ab1a https://map.ffnw.de/#!/60e327cf0982-68725138caed https://map.ffnw.de/#!/f09fc27b3754-74da88ef2288 https://map.ffnw.de/#!/8416f9c8b5a4-74da88ef2288

I have access to non of them. But we have around 350 nodes affected. Its looks quit like the ATH9K bugs behavior but I assume there also ATH10K devices.

Gluon version is: gluon-v2020.1.2-3-g281f2ea6

Therefore I think we could reopen this issue

2tata avatar Sep 03 '20 07:09 2tata

you may be right. @mbaumga also noticed a few issues in our network.

but: this has to be tracked and fixed upstream at OpenWrt or even further up, nothing will happen here at Gluon no matter how much we post or how long we have this open issue ;-)

rotanid avatar Sep 03 '20 22:09 rotanid