zephyr Openthread children attaching timeout

Is your enhancement proposal related to a problem? Please describe.

Openthread uses the Settings subsys from Zephyr. In the final part of the attaching process, a parent candidate (router) device processes children requests (Child ID Requests) to join the network, and in case it accepts it, the parent makes use of the Settings API to check (read) all its current children info saved on flash and add (write) the new one. Then it sends the response (Child ID response).
Depending on which hardware/platform you are using, the time taken to access settings-based storage (read+write) takes too long, making the child dismiss that router as a parent candidate.
Note that the router will make a settings read for every child it has on its table and, generally, only one settings write related to the new child.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

We are currently upgrading to LTS2. We are migrating to NVS settings because it's faster than LFS on v2.7.0 and reduces this problem that we were having in earlier versions with children timing out a parent candidate. When using the internal flash, our application can get to ~50 children in a parent with NVS, compared to about 10~20 with LFS.

Describe the solution you'd like A clear and concise description of what you want to happen.

After sending a Child Id Request, the child waits for 1250ms before timing out the parent candidate. That means that the total time a parent has to process the request is:
time_to_read*number_of_children_attached + time_to_write + other_steps < 1250

As I understand the settings readings to be the bottleneck (explained later) I would expect it to not take more than 2ms if we want to achieve 511 children attached to a single router device, which is the maximum the protocol allows.

I'm not sure what could be done, but I think it's worth discussing it given that the possible amount of children in practice in some cases is only 10% of what the protocol allows. I don't know if major performance improvements still could be made on settings backends or if the usage of Settings by Openthread could be optimized. If nothing can be done about it, it would be nice to have at least a warning on CONFIG_OPENTHREAD_MAX_CHILDREN symbol, noticing that memory performance could limit its practical value.

Additional context Add any other context or graphics (drag-and-drop an image) about the feature request here.

I suppose that some HW Openthread tests are being made with QSPI external flash. I didn't have problems yet when using QSPI, given that it's considerably faster. However, our router devices only use nrf52840 custom board without external flash.

I've forked Zephyr and created a simple application so others can reproduce this. I'm using a Nina B3 EVK board in this test, but I understand that other boards like nrf52840-DK should also have the same results. I've flashed the router code on one board and the child code on another. I kept the router device plugged into the computer to get its stats with logging and kept rebooting the child device with the Reset button, so the router kept increasing its child table.

To test it, I've done:

west init --mr child_attaching_timeout -m https://github.com/rodrigobrochado/zephyr.git zephyrproject
cd zephyrproject/zephyr
west update
cd samples/net/openthread/settings_lfs
# on child device:
west build --board ubx_evkninab3_nrf52840 -- -DCONFIG_OPENTHREAD_MTD=y -DCONFIG_OPENTHREAD_MTD_SED=y
# on router device:
west build --board ubx_evkninab3_nrf52840 -- -DCONFIG_OPENTHREAD_FTD=y -DCONFIG_SETTINGS_FS=y
west flash

To switch the router to NVS, I used -DCONFIG_SETTINGS_NVS=y instead of -DCONFIG_SETTINGS_FS=y.

With the branch above cherry picked to the top of v3.0.0, I could attach about only 40 children on LFS but if you keep trying to add children, it will eventually compress the settings file, which will improve the Settings process and more children can be added. This happens a few times, which is not ideal but could get to more than 100 children at the end. And with NVS I could attach no more than 58 children to the router.

With the branch on top of main c31cb007 (the default) the LFS backend starts having problems earlier at ~25 children and it was harder to keep adding children than on v3.0.0.
The same 58 children limit is reached with the NVS router.

I was preparing to bring a bunch of stats about the timing elapsed in each step, but they are not strictly the same and is easier to see it with the loggings I've added on the fork. I could sum up by saying that the writing time on the example application above takes a maximum of 40ms on NVS and generally 100ms on LFS. It seems a good value, and as the writing happens only once, this would not be a bottleneck. In the other hand, the reading time increases proportionally to the number of children in the router table (Settings) as expected, and it can get to above 35ms on both backends. These reading values are above the 2ms I've mentioned, causing attaching problems, but maybe I'm being too rigorous on a Cortex M-4 internal flash.

I appreciate any comments. Thanks!

Mar 14 '22 00:03 rodrigobrochado

I'm concluding that this turns out to be a bug.

As I mentioned we are updating and testing zephyr v2.7.0 (with NCS 1.8.0) on our nrf52840 custom boards, with NVS Settings. We are conducting a lab test with 40 openthread SED devices, 3 routers (FTD), and one Raspberry Pi 3B border router. We've enabled logging on 1 router and getting its logs with SWD (JLink). After a clean wipe of the flash (nrfjprog -e) the time elapsed for reading a ChildInfo and writing one is, respectively, around 1ms and 50 ms.

The test is in a noisy environment and routers are intentionally not much close to each other with RSSI of around -60 to -80.

After time passes, the router reboots itself some times due to the network watchdog but keeps its operation normally. However, the time to read a ChildInfo starts increasing as if the settings started getting some trash index that was not "garbage collected"; as if the settings storage is aging. The log below was at the moment of a child attaching, 2 or 3 days after the clean wipe of the flash and you can see that the time elapsed to read one ChildInfo increases to 100ms, even if the router has only 2 children on its table. I will try to keep the router running more days, but it'll certainly continue to increase.

Log

[12:20:06.966,796]  net_openthread: [INFO]-MAC-----: Received IPv6 UDP msg, len:114, chksum:ab87, from:7a7f71b789ef80b2, sec:no, prio:net, rss:-68.0
[12:20:06.967,071]  net_openthread: [INFO]-MAC-----:     src:[fe80:0:0:0:787f:71b7:89ef:80b2]:19788
[12:20:06.967,346]  net_openthread: [INFO]-MAC-----:     dst:[fe80:0:0:0:3872:36ba:4133:1d09]:19788
[12:20:06.968,750]  net_openthread: [INFO]-MLE-----: Receive Child ID Request (fe80:0:0:0:787f:71b7:89ef:80b2)
[12:20:06.969,360]  net_openthread: [INFO]-MLE-----: Child 0x0000 IPv6 address[1]=fd50:638d:bfa7:0:ac36:c404:c4d2:a31
[12:20:06.969,635]  net_openthread: [INFO]-MLE-----: Child 0x0000 has 1 registered IPv6 address, 1 address stored
[12:20:07.072,540]  net_otPlat_settings: Read in: 103ms
[12:20:07.173,797]  net_otPlat_settings: Read in: 101ms
[12:20:07.265,686]  net_otPlat_settings: Read in: 92ms
[12:20:07.265,991]  net_openthread: [INFO]-CORE----: [settings] Read ChildInfo {rloc:0x2c1a, extaddr:06ccccb3d22a6ae3, timeout:240, mode:0x04, version:2}
[12:20:07.357,177]  net_otPlat_settings: Read in: 91ms
[12:20:07.357,513]  net_openthread: [INFO]-CORE----: [settings] Read ChildInfo {rloc:0x2c12, extaddr:6ee5df952b45aab1, timeout:240, mode:0x0d, version:2}
[12:20:07.457,885]  net_otPlat_settings: Read in: 100ms
[12:20:07.559,722]  net_otPlat_settings: Check in: 101ms
[12:20:07.647,094]  instruction_listener: Response is missing payload. Code: 68
[12:20:07.656,158]  net_otPlat_settings: Saved in: 97ms
[12:20:07.656,616]  net_openthread: [INFO]-CORE----: [settings] Added ChildInfo {rloc:0x2c1b, extaddr:7a7f71b789ef80b2, timeout:300, mode:0x04, version:3}
[12:20:07.658,416]  net_openthread: [INFO]-MLE-----: Send Child ID Response (fe80:0:0:0:787f:71b7:89ef:80b2,0x2c1b)
[12:20:07.658,996]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:0, rss:-67, ack-fp:0
[12:20:07.659,515]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:0, rss:-67, ack-fp:0
[12:20:07.660,034]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:0, rss:-67, ack-fp:0
[12:20:07.660,552]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:0, rss:-66, ack-fp:0
[12:20:07.661,071]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:0, rss:-67, ack-fp:0
[12:20:07.661,590]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:0, rss:-67, ack-fp:0
[12:20:07.662,170]  net_openthread: [INFO]-CORE----: Notifier: StateChanged (0x00000400) [Child+]
[12:20:07.662,780]  net_openthread: [INFO]-MAC-----: Prepping indir tx IPv6 UDP msg, len:185, chksum:e135, to:7a7f71b789ef80b2, sec:no, prio:net
[12:20:07.668,151]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:1, rss:-67, ack-fp:1
[12:20:07.675,384]  net_openthread: [INFO]-MAC-----: Frame tx attempt 1/2 failed, error:NoAck, len:126, seqnum:181, type:Data, src:3a7236ba41331d09, dst:7a7f71b789ef80b2, sec:yes, ackre
[12:20:07.688,781]  net_openthread: [INFO]-MAC-----: Rx data poll, src:0x2c1b, qed_msgs:1, rss:-67, ack-fp:1
[12:20:07.695,220]  net_openthread: [INFO]-MAC-----: Sent IPv6 UDP msg, len:185, chksum:e135, to:7a7f71b789ef80b2, sec:yes, prio:net
[12:20:07.695,495]  net_openthread: [INFO]-MAC-----:     src:[fe80:0:0:0:3872:36ba:4133:1d09]:19788
[12:20:07.695,770]  net_openthread: [INFO]-MAC-----:     dst:[fe80:0:0:0:787f:71b7:89ef:80b2]:19788

Our current workaround would be to periodically iterate over the settings entries by using otPlatSettingsGet() and checking how much time it will take to read all ChildInfo. If it takes more than, let's say 700 ms, we wipe the flash and restart the device. We've flashed this "patch" by FOTA to the other two routers on the test network and one of them that has more SED devices next to it (like 15) was already taking 800ms to read all its ChildInfo.

Should we change this issue label from "enhancement" to "bug"? I appreciate any help! Thanks

Mar 19 '22 20:03 rodrigobrochado

Hi @rodrigobrochado, regardless of the issue with too slow settings subsystem needing to be fixed, there could be ways to mitigate the problem within OpenThread code: https://github.com/openthread/openthread/pull/7497

Mar 21 '22 15:03 edmont

Awesome @edmont!!!!!!! I was afraid the position of SetChildStateToValid() would not be possible to change. That will probably end my nightmare and be a totally scalable solution (in number of children attached ot a router)! It should also reduces the number of DataPoll a SED needs to send as it was required before (the SED would keep sending it every 100ms while the router was accessing flash). I will try to cherry pick this soon and introuce it on our lab test. Thank you!

Mar 21 '22 17:03 rodrigobrochado

About the degration of settings performance, we should still investigate it. I will try to bring something today in a more reproducible way than waiting for days in a lab test.

Mar 21 '22 17:03 rodrigobrochado

@rodrigobrochado, could it be that ot adds items to storage that are never deleted ? If that is the case many garbage collections will be triggered resulting in slow behavior and finally no more space. There is no way for settings to know that certain items are no longer valid, this is the responsability of the user.

Mar 21 '22 19:03 Laczen

Have you found any solution to this problem? Could doing nvs operations in a different thread uncouple the delay?

Sep 18 '22 15:09 brian-level

@brian-level, this problem should be highly reduced by enabling the nvs cache. This can be done using the config option: CONFIG_NVS_LOOKUP_CACHE.

Sep 18 '22 15:09 Laczen

Thank you! That indeed fixes issues I was seeing.

Sep 20 '22 16:09 brian-level

I've tried the Cache with the LFS, but as it was not recommended for OT, I've changed to NVS. I don't remember if I've tested the NVS cache though. We've made and workaround to wipe flash when the children addition to the list takes too much time, but we will be stuck in Zephyr v2.7.0 (NCS 1.8.0) and probably this workaround for a while.

@brian-level can I ask you which Zephyr version you are using and how many children total and per router you are handling? This can better guide us in our future tests with NVS lookup cache.

Sep 20 '22 23:09 rodrigobrochado

I am using NCS 2.1.0. I am trying to scale to very large networks (>200 nodes) but haven't built that many yet. I am also thinking OT should not have to use non-volatile storage at all in FTD mode (always powered) and setup each boot by the application so am looking in to creating a ram-only backend for OT nvs perhaps.

Sep 21 '22 12:09 brian-level

Thanks for sharing. Have you seen the CONFIG_OPENTHREAD_SETTINGS_RAM? It was recommended only for testing when I saw it, not sure if it changed since. Some more here and here.

Sep 27 '22 16:09 rodrigobrochado

I asked @rlubos about progress on this issue in Discord and he linked to merged PR https://github.com/zephyrproject-rtos/zephyr/pull/46839. Can anyone confirm these PR changes fix or improve this issue?

Jan 18 '23 22:01 nixward

Hi @rlubos,

This issue, marked as an Enhancement, was opened a while ago and did not get any traction. It was just assigned to you based on the labels. If you don't consider yourself the right person to address this issue, please re-assing it to the right person.

Please take a moment to review if the issue is still relevant to the project. If it is, please provide feedback and direction on how to move forward. If it is not, has already been addressed, is a duplicate, or is no longer relevant, please close it with a short comment explaining the reason.

@rodrigobrochado you are also encouraged to help moving this issue forward by providing additional information and confirming this request/issue is still relevant to you.

Thanks!

Feb 09 '24 22:02 zephyrbot

Closing due to inactivity, please reopen if the issue still persist with the latest Zephyr.

Feb 12 '24 08:02 rlubos

zephyr zephyr copied to clipboard

Openthread children attaching timeout

zephyr
zephyr copied to clipboard