tinyusb
tinyusb copied to clipboard
Stack overflow with the new NCM driver
Operating System
Linux
Board
Custom ESP32 S3
Firmware
My own, but this is not relevant - will explain below
What happened ?
The new USB NCM driver - introduced in #2227 - is causing a stack overflow.
It causes the problem like this:
tud_network_recv_renew will call into recv_transfer_datagram_to_glue_logic. recv_transfer_datagram_to_glue_logic will in turn call tud_network_recv_cb if there's data - and when the cb called, the recv_glue_ntb is not cleared yet.
The problem is that some tud_network_recv_cb, like the one in esp_tinyusb, will unconditionally call tud_network_recv_renew. So the whole thing happens again and again, causing infinite recursion.
So, it's either Espressif will need to change their code ( @ESP-Coco ) and documentation added in TinyUSB to prohibit calling tud_network_recv_renew directly within tud_network_recv_cb, or the logic within recv_transfer_datagram_to_glue_logic need to be changed to prevent this infinite recursion.
@HiFiPhile @rgrr could I ask for your opinions on this?
How to reproduce ?
If you are not using ESP32, call tud_network_recv_renew directly within tud_network_recv_cb like the ESP code mentioned above.
If you are using ESP32 with ESP-IDF, you can try to replace the src/class/net folder of the tinyusb component to the latest master branch of tinyusb.
Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)
Screenshots
No response
I have checked existing issues, dicussion and documentation
- [X] I confirm I have checked existing issues, dicussion and documentation.
I think the current situation on reception handling is confusing and probably wrong. tud_network_recv_renew seems to be called twice in a single packet reception, the first time by tud_network_recv_renew_r which is in turn called by netd_xfer_cb, and another time by user logic when it finished processing the current packet. This also means recv_try_to_start_new_reception is called twice in a single reception.
OK, I understand that we may process multiple NCM "subpackets" in one callback, and that's why we need to repeatedly call tud_network_recv_renew.
So, I propose a solution to this issue to add re-entrance detection logic to tud_network_recv_renew function.
void tud_network_recv_renew(void) {
TU_LOG_DRV("tud_network_recv_renew()\n");
static bool processing = false;
static bool should_process = false;
should_process = true;
if (processing) {
TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
return;
}
while (should_process) {
should_process = false;
processing = true;
// If the current function is called within recv_transfer_datagram_to_glue_logic,
// should_process will become true, and the loop will run again
// Otherwise the loop will not run again
recv_transfer_datagram_to_glue_logic();
processing = false;
}
recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew
Hmmm... not getting the point. Do you think the problem is in the NCM driver or in the glue code?
Perhaps my own glue code could be of help?: https://github.com/rgrr/yapicoprobe/blob/master/src/net/net_glue.c
I think it's in the NCM driver, which cause compatibility issues with older codes that calls tud_network_recv_renew directly within tud_network_recv_cb. And this can be fixed by my function above.
ok... understood. Good point. Loop above seems to be too complicated for my eyes. Is something like this working:
void tud_network_recv_renew(void) {
TU_LOG_DRV("tud_network_recv_renew()\n");
static bool should_process = false; // should go into ncm_interface_t
if (should_process) {
TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
return;
}
should_process = true;
while (should_process) {
should_process = false;
// If the current function is called within recv_transfer_datagram_to_glue_logic,
// should_process will become true, and the loop will run again
// Otherwise the loop will not run again
recv_transfer_datagram_to_glue_logic();
}
recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew
Or will it end up again in an endless loop?
No, this will not work, because when you call recv_transfer_datagram_to_glue_logic(), should_process is false, so the if (should_process) block won't be executed. Keep in mind that the second call to tud_network_recv_renew is inside recv_transfer_datagram_to_glue_logic.
Hmmm... you are right (perhaps I need some time to rethink it ;-))
Perhaps a simple semaphore could do?:
void tud_network_recv_renew(void) {
static bool sema = false;
if ( !sema) {
sema = true;
TU_LOG_DRV("tud_network_recv_renew()\n");
recv_transfer_datagram_to_glue_logic();
sema = false;
}
recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew
This will not work when there are 3 ntbs handled in one call to netd_xfer_cb.
I think either a loop or a recursion is needed in case tud_network_recv_renew is called directly within tud_network_recv_cb. One call to recv_transfer_datagram_to_glue_logic only process on ntb. As we need to process multiple ntbs, it's obvious that multiple calls to recv_transfer_datagram_to_glue_logic is needed. We don't want recursions (because that might cause stack overflow on MCUs), so this means we need loops.
you are right (again)... perhaps it could be solved by letting recv_transfer_datagram_to_glue_logic() return an indication that some work has been done (or that there is some work still waiting)?
Yes, but I think should_process in my code is this indication.
got that, but to be honest I'm not happy with two variables for this logic.
Could you please check PR #2713