mynewt-nimble ll/rand: restructure rand initialization

This one took me some time to track down :-) Under certain situations, NimBLE on RIOT hard-faulted on nRF boards. This happend maybe every 5th or so time a board was rebootet, mainly showing for RIOTs nimble_scanner example application. It turned out, that the late initialization of the pseudo random number generator as introduced in #883 has caused this:

the first ever call to jrand48() causes a memory allocation (malloc) inside the libc code
as ble_ll_rand() is partly called from interrupt context, this could trigger the late initialization in the interrupt context, leading to a malloc to be executed in interrupt context
end of last year, a thread-safe malloc wrapper was introduced to RIOT, which does not allow for malloc calls from interrupt context, triggering a failed assertion (see https://github.com/RIOT-OS/RIOT/pull/15606)
this leads to kind of a race condition for the nimble_scanner example: if the first call to ble_ll_rand() is coming from thread context, everything is fine, if its by chance coming from interrupt context, the node hard-faults due to a failed assertion.

In this PR I propose a pretty simple fix: why don't we simplify the ble_ll_rand() initialization per se and merge all three initialization steps into the ble_ll_rand_init() function?! This saves a little RAM (no static init var), makes the code more concise and prevents the error described above. I don't really see a drawback, as the split between ble_ll_rand_init() and ble_ll_rand_start() seems rather arbitrary to me, both are called exactly once and right after each other from the controllers init code anyway...

Apr 14 '21 16:04 haukepetersen

Forgot to reference - here is the path towards the bug in RIOT: https://github.com/RIOT-OS/RIOT/pull/16317

Apr 14 '21 16:04 haukepetersen

The commit you mentioned has this in commit message "Use jrand48 instead and provide wrapper for late initialization. This is to avoid seeding on init beacouse trng might not be accessible yet." Wouldn't this commit re-introduce this behaviour?

Apr 15 '21 07:04 sjanc

Yes and no. Seeding the jrand48 during init of course takes away the (non-deterministic) delay between TRNG init and jrand48 seeding. But if I understand ble_ll_rand_data_get() correctly, it should only return after the needed number of random bytes was actually created, right?

Or do I miss something about the trng might not be accessible yet? Does this mean, that the tnrg_read() loop in ble_ll_rand_get_data() gets stuck? Or might that loop not return any randomness? If so a solution might be to do the late seeding of jrand only for the case that TRNG is used, and stick to the seeding on init if that is not the case. In any case, I would prefer a deterministic behavior of the rand module, so its not dependent on who and when the initial calls to its functions are made...

But I just discovered that I messed up the scope in the init function, right now jrand is never seeded in case the TRNG is used... Will fix.

Apr 15 '21 07:04 haukepetersen

Style check summary

No suggestions at this time!

Apr 15 '21 07:04 apache-mynewt-bot

@andrzej-kaczmarek could you check on how this would affect dialog hw?

Apr 19 '21 10:04 sjanc

@andrzej-kaczmarek any chance you got a minute to look at this?

May 12 '21 06:05 haukepetersen

@haukepetersen this breaks dialog as it basically reverts to old behavior, but I wonder if you can simply call jrand48() on init and discard that value? It would be called in task context so malloc() should work, then you subsequent calls won't malloc() anymore. Does that make sense?

May 19 '21 09:05 andrzej-kaczmarek

we could also use jrand48_r in bt_ll_rand() which shouldn't require any allocations

May 19 '21 11:05 sjanc

@andrzej-kaczmarek I was expecting this, though I was still hoping we could simplify this part of the code a little bit by streamlining it... So do I get it correctly, that the dialog port breaks because the hardware RNG is not ready at the time of controller initialization? And do I understand correctly that the dialog hw using the MYNEWT_VAL(TRNG) code path?

@sjanc at least for the RIOT build, jrand48_r does not seem to be included in the used stdlib. At least simply using it instead of jrand48 does not compile. But my stdlib knowledge might not be the best, so maybe its something simple that I missed...

May 21 '21 08:05 haukepetersen

@haukepetersen we don't have it in mynewt baselibc either (although for us implementation would be trivial wrapper around jrand48)

May 21 '21 13:05 sjanc

ok, then its not just RIOT :-). Anyway, I still think a clean initialization during startup would be best, also considering timing effects of the current late initialization, who knows where that might have an effect in the future...

Do you think there is any way we go for the clean initialization and only apply the late-init variant specifically for dialog targets?

but I wonder if you can simply call jrand48() on init and discard that value?

This would do the trick - and this is what I implemented as a work around in the NimBLE setup code in RIOT. But from the knowledge we/I know gained, I wonder if calling ble_ll_rand() from e.g. ble_ll_rand_start() while discarding the returned value as generic fix would still break on dialog targets? As in the end this does nothing else then the code proposed in this PR, right?!

May 21 '21 14:05 haukepetersen

so how this ended up? can we close this or something is stil required on nimble side?

Apr 13 '22 10:04 sjanc

mynewt-nimble mynewt-nimble copied to clipboard

ll/rand: restructure rand initialization

Style check summary

No suggestions at this time!

mynewt-nimble
mynewt-nimble copied to clipboard