RIOT icon indicating copy to clipboard operation
RIOT copied to clipboard

lwip, submac: deadlock on user transmission

Open mguetschow opened this issue 1 month ago • 5 comments

Description

Sending any (user) packet over 802154 submac using lwip stack will deadlock. I traced the issue down to a race condition between the main thread, which requests BH before setting the fsm state to PREPARE, and the lwip_netdev_mux thread who will happily try to handle the BH before while the fsm state is still RX.

This does not happen with GNRC as all submac interaction happens on a separate thread there.

Steps to reproduce the issue

Print out thread names on debug prints with

diff --git a/core/lib/include/debug.h b/core/lib/include/debug.h
index 620de78267..1d4bdef600 100644
--- a/core/lib/include/debug.h
+++ b/core/lib/include/debug.h
@@ -121,7 +121,7 @@ extern "C" {
  * @details If a variable is only accessed by `DEBUG()`, the compiler will
  *          warn about unused variables when `ENABLE_DEBUG` is set to `0`.
  */
-#define DEBUG(...) do { if (ENABLE_DEBUG) { DEBUG_PRINT(__VA_ARGS__); } } while (0)
+#define DEBUG(...) do { if (ENABLE_DEBUG) { puts(thread_get_name(thread_get_active())); DEBUG_PRINT(__VA_ARGS__); } } while (0)
 
 /**
  * @def DEBUG_PUTS
@@ -129,7 +129,7 @@ extern "C" {
  * @brief Print debug information to stdout using puts(), so no stack size
  *        restrictions do apply.
  */
-#define DEBUG_PUTS(str) do { if (ENABLE_DEBUG) { puts(str); } } while (0)
+#define DEBUG_PUTS(str) do { if (ENABLE_DEBUG) { puts(thread_get_name(thread_get_active())); puts(str); } } while (0)
 /** @} */
 
 /**

Enable debug prints for cpu/nrf52/radio/nrf802154/nrf802154_radio.c, drivers/netdev_ieee802154_submac/netdev_ieee802154_submac.c and /pkg/lwip/contrib/netdev/lwip_netdev.c.

Run LWIP_IPV6=1 make -C examples/networking/coap/gcoap_dtls BOARD=nrf52840dk flash term -j

Expected results

No race condition, submac stuff should be handled on a single thread I guess?

Actual results

coap get coap://[fe80::1]/st
2025-11-04 14:23:33,375 # coap get coap://[fe80::1]/
2025-11-04 14:23:33,379 # gcoap_cli: sending msg ID 64789, 6 bytes
2025-11-04 14:23:33,380 # main
2025-11-04 14:23:33,387 # IEEE802154 submac: ieee802154_submac_process_ev(): IEEE802154_FSM_STATE_RX + REQUEST_TX
2025-11-04 14:23:33,388 # main
2025-11-04 14:23:33,391 # [nrf802154] Device state: DISABLED
2025-11-04 14:23:33,391 # main
2025-11-04 14:23:33,394 # [nrf802154] Send a packet
2025-11-04 14:23:33,394 # main
2025-11-04 14:23:33,399 # [nrf802154] send: putting 64 bytes into the frame buffer
2025-11-04 14:23:33,399 # main
2025-11-04 14:23:33,406 # IEEE802154 submac: ieee802154_submac_bh_request(): post NETDEV_EVENT_ISR
2025-11-04 14:23:33,406 # main
2025-11-04 14:23:33,409 # [lwip_netdev] NETDEV_EVENT_ISR
2025-11-04 14:23:33,410 # lwip_netdev_mux
2025-11-04 14:23:33,413 # [lwip_netdev] handle netdev isr
2025-11-04 14:23:33,414 # lwip_netdev_mux
2025-11-04 14:23:33,419 # IEEE802154 submac: _isr(): NETDEV_SUBMAC_FLAGS_BH_REQUEST
2025-11-04 14:23:33,421 # lwip_netdev_mux
2025-11-04 14:23:33,428 # IEEE802154 submac: ieee802154_submac_process_ev(): IEEE802154_FSM_STATE_RX + BH
2025-11-04 14:23:33,429 # lwip_netdev_mux
2025-11-04 14:23:33,431 # RX--(BH)->INVALID
2025-11-04 14:23:38,382 # gcoap: timeout for msg ID 64789

and deadlock because the main process waits for TX_DONE.

Versions

Current master.

mguetschow avatar Nov 04 '25 13:11 mguetschow

which requests BH before setting the fsm state to PREPARE

Could you check whether https://github.com/RIOT-OS/RIOT/pull/21578 fixes the issue?

Otherwise, the SubMAC is not thread-safe and should run with some locking mechanism (or at least it has to be ensured that the functions are called in the right order and not concurrently)

jia200x avatar Nov 04 '25 13:11 jia200x

Thanks for your quick reaction!

Could you check whether #21578 fixes the issue?

Looks like it very well could. Unfortunately it's not as straightforward to apply since it needs a rebase. Will try to look into this one of these days, unless @Stopkaa feels like rebasing :)

mguetschow avatar Nov 04 '25 13:11 mguetschow

Maybe we could also summon @fabian18 who recently fixed some bug on the submac layer :)

mguetschow avatar Nov 04 '25 14:11 mguetschow

I have rebased, but need to test it tomorrow

Stopkaa avatar Nov 04 '25 17:11 Stopkaa

This is probably a duplicate of https://github.com/RIOT-OS/RIOT/issues/17208

mguetschow avatar Nov 10 '25 08:11 mguetschow