mynewt-nimble
mynewt-nimble copied to clipboard
ll/rand: restructure rand initialization
This one took me some time to track down :-) Under certain situations, NimBLE on RIOT hard-faulted on nRF boards. This happend maybe every 5th or so time a board was rebootet, mainly showing for RIOTs nimble_scanner
example application. It turned out, that the late initialization of the pseudo random number generator as introduced in #883 has caused this:
- the first ever call to
jrand48()
causes a memory allocation (malloc
) inside the libc code - as
ble_ll_rand()
is partly called from interrupt context, this could trigger the late initialization in the interrupt context, leading to amalloc
to be executed in interrupt context - end of last year, a thread-safe malloc wrapper was introduced to RIOT, which does not allow for
malloc
calls from interrupt context, triggering a failed assertion (see https://github.com/RIOT-OS/RIOT/pull/15606) - this leads to kind of a race condition for the
nimble_scanner
example: if the first call toble_ll_rand()
is coming from thread context, everything is fine, if its by chance coming from interrupt context, the node hard-faults due to a failed assertion.
In this PR I propose a pretty simple fix: why don't we simplify the ble_ll_rand()
initialization per se and merge all three initialization steps into the ble_ll_rand_init()
function?! This saves a little RAM (no static init
var), makes the code more concise and prevents the error described above. I don't really see a drawback, as the split between ble_ll_rand_init()
and ble_ll_rand_start()
seems rather arbitrary to me, both are called exactly once and right after each other from the controllers init code anyway...
Forgot to reference - here is the path towards the bug in RIOT: https://github.com/RIOT-OS/RIOT/pull/16317
The commit you mentioned has this in commit message "Use jrand48 instead and provide wrapper for late initialization. This is to avoid seeding on init beacouse trng might not be accessible yet." Wouldn't this commit re-introduce this behaviour?
Yes and no. Seeding the jrand48 during init of course takes away the (non-deterministic) delay between TRNG init and jrand48 seeding. But if I understand ble_ll_rand_data_get()
correctly, it should only return after the needed number of random bytes was actually created, right?
Or do I miss something about the trng might not be accessible yet
? Does this mean, that the tnrg_read()
loop in ble_ll_rand_get_data()
gets stuck? Or might that loop not return any randomness? If so a solution might be to do the late seeding of jrand only for the case that TRNG is used, and stick to the seeding on init if that is not the case. In any case, I would prefer a deterministic behavior of the rand
module, so its not dependent on who and when the initial calls to its functions are made...
But I just discovered that I messed up the scope in the init function, right now jrand is never seeded in case the TRNG is used... Will fix.
Style check summary
No suggestions at this time!
@andrzej-kaczmarek could you check on how this would affect dialog hw?
@andrzej-kaczmarek any chance you got a minute to look at this?
@haukepetersen this breaks dialog as it basically reverts to old behavior, but I wonder if you can simply call jrand48() on init and discard that value? It would be called in task context so malloc() should work, then you subsequent calls won't malloc() anymore. Does that make sense?
we could also use jrand48_r in bt_ll_rand() which shouldn't require any allocations
@andrzej-kaczmarek I was expecting this, though I was still hoping we could simplify this part of the code a little bit by streamlining it... So do I get it correctly, that the dialog port breaks because the hardware RNG is not ready at the time of controller initialization? And do I understand correctly that the dialog hw using the MYNEWT_VAL(TRNG)
code path?
@sjanc at least for the RIOT build, jrand48_r
does not seem to be included in the used stdlib. At least simply using it instead of jrand48
does not compile. But my stdlib knowledge might not be the best, so maybe its something simple that I missed...
@haukepetersen we don't have it in mynewt baselibc either (although for us implementation would be trivial wrapper around jrand48)
ok, then its not just RIOT :-). Anyway, I still think a clean initialization during startup would be best, also considering timing effects of the current late initialization, who knows where that might have an effect in the future...
Do you think there is any way we go for the clean initialization and only apply the late-init variant specifically for dialog targets?
but I wonder if you can simply call jrand48() on init and discard that value?
This would do the trick - and this is what I implemented as a work around in the NimBLE setup code in RIOT. But from the knowledge we/I know gained, I wonder if calling ble_ll_rand()
from e.g. ble_ll_rand_start()
while discarding the returned value as generic fix would still break on dialog targets? As in the end this does nothing else then the code proposed in this PR, right?!
so how this ended up? can we close this or something is stil required on nimble side?