lwip, submac: deadlock on user transmission
Description
Sending any (user) packet over 802154 submac using lwip stack will deadlock. I traced the issue down to a race condition between the main thread, which requests BH before setting the fsm state to PREPARE, and the lwip_netdev_mux thread who will happily try to handle the BH before while the fsm state is still RX.
This does not happen with GNRC as all submac interaction happens on a separate thread there.
Steps to reproduce the issue
Print out thread names on debug prints with
diff --git a/core/lib/include/debug.h b/core/lib/include/debug.h
index 620de78267..1d4bdef600 100644
--- a/core/lib/include/debug.h
+++ b/core/lib/include/debug.h
@@ -121,7 +121,7 @@ extern "C" {
* @details If a variable is only accessed by `DEBUG()`, the compiler will
* warn about unused variables when `ENABLE_DEBUG` is set to `0`.
*/
-#define DEBUG(...) do { if (ENABLE_DEBUG) { DEBUG_PRINT(__VA_ARGS__); } } while (0)
+#define DEBUG(...) do { if (ENABLE_DEBUG) { puts(thread_get_name(thread_get_active())); DEBUG_PRINT(__VA_ARGS__); } } while (0)
/**
* @def DEBUG_PUTS
@@ -129,7 +129,7 @@ extern "C" {
* @brief Print debug information to stdout using puts(), so no stack size
* restrictions do apply.
*/
-#define DEBUG_PUTS(str) do { if (ENABLE_DEBUG) { puts(str); } } while (0)
+#define DEBUG_PUTS(str) do { if (ENABLE_DEBUG) { puts(thread_get_name(thread_get_active())); puts(str); } } while (0)
/** @} */
/**
Enable debug prints for cpu/nrf52/radio/nrf802154/nrf802154_radio.c, drivers/netdev_ieee802154_submac/netdev_ieee802154_submac.c and /pkg/lwip/contrib/netdev/lwip_netdev.c.
Run LWIP_IPV6=1 make -C examples/networking/coap/gcoap_dtls BOARD=nrf52840dk flash term -j
Expected results
No race condition, submac stuff should be handled on a single thread I guess?
Actual results
coap get coap://[fe80::1]/st
2025-11-04 14:23:33,375 # coap get coap://[fe80::1]/
2025-11-04 14:23:33,379 # gcoap_cli: sending msg ID 64789, 6 bytes
2025-11-04 14:23:33,380 # main
2025-11-04 14:23:33,387 # IEEE802154 submac: ieee802154_submac_process_ev(): IEEE802154_FSM_STATE_RX + REQUEST_TX
2025-11-04 14:23:33,388 # main
2025-11-04 14:23:33,391 # [nrf802154] Device state: DISABLED
2025-11-04 14:23:33,391 # main
2025-11-04 14:23:33,394 # [nrf802154] Send a packet
2025-11-04 14:23:33,394 # main
2025-11-04 14:23:33,399 # [nrf802154] send: putting 64 bytes into the frame buffer
2025-11-04 14:23:33,399 # main
2025-11-04 14:23:33,406 # IEEE802154 submac: ieee802154_submac_bh_request(): post NETDEV_EVENT_ISR
2025-11-04 14:23:33,406 # main
2025-11-04 14:23:33,409 # [lwip_netdev] NETDEV_EVENT_ISR
2025-11-04 14:23:33,410 # lwip_netdev_mux
2025-11-04 14:23:33,413 # [lwip_netdev] handle netdev isr
2025-11-04 14:23:33,414 # lwip_netdev_mux
2025-11-04 14:23:33,419 # IEEE802154 submac: _isr(): NETDEV_SUBMAC_FLAGS_BH_REQUEST
2025-11-04 14:23:33,421 # lwip_netdev_mux
2025-11-04 14:23:33,428 # IEEE802154 submac: ieee802154_submac_process_ev(): IEEE802154_FSM_STATE_RX + BH
2025-11-04 14:23:33,429 # lwip_netdev_mux
2025-11-04 14:23:33,431 # RX--(BH)->INVALID
2025-11-04 14:23:38,382 # gcoap: timeout for msg ID 64789
and deadlock because the main process waits for TX_DONE.
Versions
Current master.
which requests BH before setting the fsm state to PREPARE
Could you check whether https://github.com/RIOT-OS/RIOT/pull/21578 fixes the issue?
Otherwise, the SubMAC is not thread-safe and should run with some locking mechanism (or at least it has to be ensured that the functions are called in the right order and not concurrently)
Thanks for your quick reaction!
Could you check whether #21578 fixes the issue?
Looks like it very well could. Unfortunately it's not as straightforward to apply since it needs a rebase. Will try to look into this one of these days, unless @Stopkaa feels like rebasing :)
Maybe we could also summon @fabian18 who recently fixed some bug on the submac layer :)
I have rebased, but need to test it tomorrow
This is probably a duplicate of https://github.com/RIOT-OS/RIOT/issues/17208