tinyusb icon indicating copy to clipboard operation
tinyusb copied to clipboard

Stack overflow with the new NCM driver

Open showier-drastic opened this issue 1 year ago • 12 comments

Operating System

Linux

Board

Custom ESP32 S3

Firmware

My own, but this is not relevant - will explain below

What happened ?

The new USB NCM driver - introduced in #2227 - is causing a stack overflow.

It causes the problem like this:

tud_network_recv_renew will call into recv_transfer_datagram_to_glue_logic. recv_transfer_datagram_to_glue_logic will in turn call tud_network_recv_cb if there's data - and when the cb called, the recv_glue_ntb is not cleared yet.

The problem is that some tud_network_recv_cb, like the one in esp_tinyusb, will unconditionally call tud_network_recv_renew. So the whole thing happens again and again, causing infinite recursion.

So, it's either Espressif will need to change their code ( @ESP-Coco ) and documentation added in TinyUSB to prohibit calling tud_network_recv_renew directly within tud_network_recv_cb, or the logic within recv_transfer_datagram_to_glue_logic need to be changed to prevent this infinite recursion.

@HiFiPhile @rgrr could I ask for your opinions on this?

How to reproduce ?

If you are not using ESP32, call tud_network_recv_renew directly within tud_network_recv_cb like the ESP code mentioned above.

If you are using ESP32 with ESP-IDF, you can try to replace the src/class/net folder of the tinyusb component to the latest master branch of tinyusb.

Debug Log as txt file (LOG/CFG_TUSB_DEBUG=2)

1.txt

Screenshots

No response

I have checked existing issues, dicussion and documentation

  • [X] I confirm I have checked existing issues, dicussion and documentation.

showier-drastic avatar Jul 14 '24 13:07 showier-drastic

I think the current situation on reception handling is confusing and probably wrong. tud_network_recv_renew seems to be called twice in a single packet reception, the first time by tud_network_recv_renew_r which is in turn called by netd_xfer_cb, and another time by user logic when it finished processing the current packet. This also means recv_try_to_start_new_reception is called twice in a single reception.

showier-drastic avatar Jul 14 '24 13:07 showier-drastic

OK, I understand that we may process multiple NCM "subpackets" in one callback, and that's why we need to repeatedly call tud_network_recv_renew.

So, I propose a solution to this issue to add re-entrance detection logic to tud_network_recv_renew function.

void tud_network_recv_renew(void) {
  TU_LOG_DRV("tud_network_recv_renew()\n");

  static bool processing = false;
  static bool should_process = false;

  should_process = true;
  if (processing) {
    TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
    return;
  }

  while (should_process) {
    should_process = false;
    processing = true;
    // If the current function is called within recv_transfer_datagram_to_glue_logic,
    // should_process will become true, and the loop will run again
    // Otherwise the loop will not run again
    recv_transfer_datagram_to_glue_logic();
    processing = false;
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

showier-drastic avatar Jul 14 '24 14:07 showier-drastic

Hmmm... not getting the point. Do you think the problem is in the NCM driver or in the glue code?

Perhaps my own glue code could be of help?: https://github.com/rgrr/yapicoprobe/blob/master/src/net/net_glue.c

rgrr avatar Jul 14 '24 18:07 rgrr

I think it's in the NCM driver, which cause compatibility issues with older codes that calls tud_network_recv_renew directly within tud_network_recv_cb. And this can be fixed by my function above.

showier-drastic avatar Jul 15 '24 02:07 showier-drastic

ok... understood. Good point. Loop above seems to be too complicated for my eyes. Is something like this working:

void tud_network_recv_renew(void) {
  TU_LOG_DRV("tud_network_recv_renew()\n");

  static bool should_process = false;    // should go into ncm_interface_t

  if (should_process) {
    TU_LOG_DRV("Re-entrant into tud_network_recv_renew, will process later\n");
    return;
  }
  should_process = true;

  while (should_process) {
    should_process = false;
    // If the current function is called within recv_transfer_datagram_to_glue_logic,
    // should_process will become true, and the loop will run again
    // Otherwise the loop will not run again
    recv_transfer_datagram_to_glue_logic();
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

Or will it end up again in an endless loop?

rgrr avatar Jul 15 '24 06:07 rgrr

No, this will not work, because when you call recv_transfer_datagram_to_glue_logic(), should_process is false, so the if (should_process) block won't be executed. Keep in mind that the second call to tud_network_recv_renew is inside recv_transfer_datagram_to_glue_logic.

showier-drastic avatar Jul 15 '24 07:07 showier-drastic

Hmmm... you are right (perhaps I need some time to rethink it ;-))

Perhaps a simple semaphore could do?:

void tud_network_recv_renew(void) {
  static bool sema = false;

  if ( !sema) {
    sema = true;

    TU_LOG_DRV("tud_network_recv_renew()\n");

    recv_transfer_datagram_to_glue_logic();
    sema = false;
  }
  recv_try_to_start_new_reception(ncm_interface.rhport);
} // tud_network_recv_renew

rgrr avatar Jul 15 '24 07:07 rgrr

This will not work when there are 3 ntbs handled in one call to netd_xfer_cb.

I think either a loop or a recursion is needed in case tud_network_recv_renew is called directly within tud_network_recv_cb. One call to recv_transfer_datagram_to_glue_logic only process on ntb. As we need to process multiple ntbs, it's obvious that multiple calls to recv_transfer_datagram_to_glue_logic is needed. We don't want recursions (because that might cause stack overflow on MCUs), so this means we need loops.

showier-drastic avatar Jul 15 '24 07:07 showier-drastic

you are right (again)... perhaps it could be solved by letting recv_transfer_datagram_to_glue_logic() return an indication that some work has been done (or that there is some work still waiting)?

rgrr avatar Jul 15 '24 08:07 rgrr

Yes, but I think should_process in my code is this indication.

showier-drastic avatar Jul 15 '24 08:07 showier-drastic

got that, but to be honest I'm not happy with two variables for this logic.

rgrr avatar Jul 15 '24 08:07 rgrr

Could you please check PR #2713

rgrr avatar Jul 15 '24 16:07 rgrr