mynewt-nimble Extended Advertising: TX freezes after ~65535 seconds

We are debugging a strange behavior we encountered when doing some experiments with extended advertisements: we have 14 nodes periodically transmitting data to a central sink node via extended advertisements. The way we transmit the data is to write the data into an unused advertising instance (configured to use a total of 10), and then advertise the data exactly 3 times.

Now we wanted to run a 24h experiment, and everything went fine for the first ~18,1 hours. But after ~65500 seconds, all of our 14 nodes stop sending out data at the same time. What we can observe, is that NimBLE stops to trigger BLE_GAP_EVENT_ADV_COMPLETE events and the counters we put into the PHY code show that the node is not transmitting any packets anymore.

So far I don't have a lead on what could cause this behavior, but here are some thoughts we have collected so far:

as 14 nodes show exactly the same behavior at the same time (in the order of the same second), my guess is that this is a systematic software bug somewhere in NimBLE
the time between reboot (all nodes are rebooted at beginning of each experiment) and the faulty behavior is awfully close to UINT16_MAX -> does intuitively point to some kind of (timer?) overflow issue
as the node stop to send out any packets on the radio, could this be related to the controller's scheduler?

Does anyone have experience running extended advertisements for longer periods of time? Or is there any idea what could be causing this? Due to a paper deadline approaching it would be great to have our setup stable ASAP :-)

Build/configuration context:

NimBLE @ RIOT
NimBLE version b9c20addd87f1692ea27600ede82f641c77315be
Using the following configuration for extended advertisements:

MYNEWT_VAL_MSYS_1_BLOCK_SIZE=294
MYNEWT_VAL_MSYS_1_BLOCK_COUNT=32

MYNEWT_VAL_BLE_MULTI_ADV_INSTANCES=9
MYNEWT_VAL_BLE_EXT_ADV=1
MYNEWT_VAL_BLE_EXT_ADV_MAX_SIZE=1650
MYNEWT_VAL_BLE_HCI_EVT_HI_BUF_COUNT=12
MYNEWT_VAL_BLE_HCI_EVT_BUF_SIZE=257
MYNEWT_VAL_BLE_LL_EXT_ADV_AUX_PTR_CNT=10
MYNEWT_VAL_BLE_LL_CFG_FEAT_LL_EXT_ADV=1

Nov 24 '21 10:11 haukepetersen

can you share instance configuration?

Nov 24 '21 13:11 sjanc

of course:

    _adv_params.scannable = 0;
    _adv_params.own_addr_type = nimble_riot_own_addr_type;
    _adv_params.channel_map = BLE_GAP_ADV_DFLT_CHANNEL_MAP;
    _adv_params.filter_policy = 0;
    _adv_params.itvl_min = _config.advertiser_itvl_min;
    _adv_params.itvl_max = _config.advertiser_itvl_max;
    _adv_params.tx_power = 127;
    _adv_params.primary_phy = BLE_HCI_LE_PHY_1M;
    _adv_params.secondary_phy = BLE_HCI_LE_PHY_1M;
    _adv_params.directed = 1;
    _adv_params.peer.type = nimble_riot_own_addr_type;
// and then setting the peer address, of course ...

As said, for the first 18 hours this configuration works flawlessly, its only then that something breaks :-)

Nov 24 '21 14:11 haukepetersen

Quick update: yesterday and tonight my 24h experiment was running for a second time. From the logfiles of the experiment I can see, that exactly 65565 seconds after the nodes were rebooted all of 15 used nodes stopped transmitting in the same second. At the beginning of the experiment the first minute or so is spend configuring the nodes using RIOT shell commands. Somewhere in there we also configure the NimBLE scan and advertising parameters, so my guess is (did not confirm this, yet) that some uint16-based counter is triggered in that process, and this counter is then overflowing after 65535 seconds, rendering all nodes mute at the same time.

So the observations stated above can be reproduced...

Nov 30 '21 07:11 haukepetersen

so this happens at the time os_cputime wraps from 0x7fffffff to 0x80000000 which most likely means that we have some place in code that does not compare os_cputime properly, i.e. via macros that cast to int32_t. however, I cannot really see such place but will keep digging...

Nov 30 '21 08:11 andrzej-kaczmarek

turns out it was much faster to just grep for > instead of looking through the code ;-) https://github.com/apache/mynewt-nimble/pull/1114

Nov 30 '21 08:11 andrzej-kaczmarek

btw, for testing you can force bsptmr_cntr->tmr_rtc = 0x7f000000; in hal_timer.c, it will then take less than 9mins for os_cputime to reach 0x80000000.

Nov 30 '21 08:11 andrzej-kaczmarek

Unfortunately #1114 did not fix this issue: I did run another 24h experiment on Saturday using the latest NimBLE master and it showed the same behavior as described above: after ~18h all nodes stopped to transmit at the same time.

Is it possible that still some os_cputime comparisons were missed somewhere?

Dec 06 '21 08:12 haukepetersen

@haukepetersen is the app used for testing available somewhere so I can test/debug on RIOT?

Dec 06 '21 09:12 andrzej-kaczmarek

Yes, I published everything as part of the artifact evaluation of our paper: https://github.com/ilabrg/artifacts-ccr-ipbleadv

The code for the app is under ea/app_jelling/. For our initial experiments, we were using the RIOT and NimBLE branches as referenced per git submodules in that repository. But for my last tests I have been using the following work branches:

RIOT: https://github.com/haukepetersen/RIOT/tree/ipbleadv
NimBLE: https://github.com/haukepetersen/mynewt-nimble/tree/ipbleadv_updt (updating to master pulling in #1114)

Dec 06 '21 10:12 haukepetersen

mynewt-nimble mynewt-nimble copied to clipboard

Extended Advertising: TX freezes after ~65535 seconds

mynewt-nimble
mynewt-nimble copied to clipboard