esp-iot-bridge icon indicating copy to clipboard operation
esp-iot-bridge copied to clipboard

USB RNDIS Failed to transmit from ESP32 to PC after random time (AEGHB-797)

Open wuyuanyi135 opened this issue 1 year ago • 6 comments

Hello all, sorry for the vague problem description. I was not able to pin the origin of the problem due to its random nature, so I will describe what I have observed here.

I made a ESP32S3 gateway that communicate with PC using the USB RNDIS interface implemented in this project. The gateway will talk to the other ESP32S3 using WiFi and forward the traffic between PC and the wireless nodes. The gateway must also act as provisioner and some other roles so it cannot be simply created using L2Tap from USB interface to softap.

What I have found was that, after some random time of continuous running, with constant traffic (about 70KBps), the gateway stops communicate with the PC. The time varys from several minutes to tens of hours so I guess uptime doesn't matter much.

I could confirm that:

  1. The gateway did not crash (its indicator LEDs are still blinking)
  2. It was the TX (from ESP32S3 to PC via RNDIS) that failed rather than RX (from PC to ESP32S3), because when I restarted the network interface in Control Panel, I could read the log showing that DHCP requests was received from the PC, repeatedly. The ESP32S3 gateway tried to respond but never heard.
  3. When the gateway failed, pinging/http request won't work (obviously, because the gateway could not transmit data)
  4. I recorded the RAM usage periodically and didn't find any growth of the RAM/PSRAM usage.
  5. When the RNDIS feature failed, the ACM CDC console still works (printing logs)
  6. The softAP is still alive and the wireless node can still talk to the gateway.

Based on what I have found, I think the CPU is working, the RAM is sufficient, the USB peripherial is alive. I guess there is some culprits in the RNDIS TX code that trigger the bug. Please kindly suggest some hint for me to narrow down the possible origin of the issue. Thank you!

Alternatively, I would like to know how to verify the USB transmission is successful or not -- I could reboot the gateway on the nonideal scenario then.

wuyuanyi135 avatar Sep 02 '24 12:09 wuyuanyi135

I would like to ask which version of idf you are using?

tswen avatar Sep 11 '24 08:09 tswen

@tswen 5.0~5.2. I have been constantly working on this firmware and observe the same issue. Since 5.3 this project no longer compile so I didn't try newer.

wuyuanyi135 avatar Sep 11 '24 08:09 wuyuanyi135

Our team has started to pinning the issue. The first thing hits me was the global variable can_xmit .

Is it possible that can_xmit might not be reset due to racing condition?

wuyuanyi135 avatar Sep 13 '24 21:09 wuyuanyi135

I could confirm that the can_xmit was the problem. After putting the logging code

esp_err_t pkt_netif2usb(void* buffer, uint16_t len) {
  if (!tud_ready()) {
    ESP_LOGE(TAG, "TUD not ready");
    return ERR_USE;
  }

  if (tud_network_wait_xmit(500)) {
    /* if the network driver can accept another packet, we make it happen */
#if ESP_IDF_VERSION >= ESP_IDF_VERSION_VAL(5, 0, 0)
    if (tud_network_can_xmit(len)) {
#else
    if (tud_network_can_xmit()) {
#endif /* ESP_IDF_VERSION >= 5.0.0 */
      // ESP_LOG_BUFFER_HEXDUMP(" netif ==> usb", buffer, len, ESP_LOG_INFO);
      ESP_LOGD(TAG, "netif => usb: %d bytes", len);
      tud_network_xmit(buffer, len);
    } else {
      ESP_LOGE(TAG, "tud_network_can_xmit false");
    }
  } else {
    ESP_LOGE(TAG, "tud_network_wait_xmit timed out");
  }

  return ESP_OK;
}

When the issue occurs, the "tud_network_wait_xmit timed out" is flooding the log.

wuyuanyi135 avatar Sep 20 '24 16:09 wuyuanyi135

Apologies, our team is currently busy with other tasks. Are there any updates on this issue? If you have any new findings or could provide a PR, we would greatly appreciate it!

tswen avatar Oct 21 '24 07:10 tswen

@tswen Hello. After resetting can_xmit after retry it seems no issue was reproduced within a few hours. I need to perform some longer and more throughout test due to the random nature of this issue. If the stability was confirmed I will report here.

wuyuanyi135 avatar Oct 23 '24 02:10 wuyuanyi135