USB RNDIS Failed to transmit from ESP32 to PC after random time (AEGHB-797)
Hello all, sorry for the vague problem description. I was not able to pin the origin of the problem due to its random nature, so I will describe what I have observed here.
I made a ESP32S3 gateway that communicate with PC using the USB RNDIS interface implemented in this project. The gateway will talk to the other ESP32S3 using WiFi and forward the traffic between PC and the wireless nodes. The gateway must also act as provisioner and some other roles so it cannot be simply created using L2Tap from USB interface to softap.
What I have found was that, after some random time of continuous running, with constant traffic (about 70KBps), the gateway stops communicate with the PC. The time varys from several minutes to tens of hours so I guess uptime doesn't matter much.
I could confirm that:
- The gateway did not crash (its indicator LEDs are still blinking)
- It was the TX (from ESP32S3 to PC via RNDIS) that failed rather than RX (from PC to ESP32S3), because when I restarted the network interface in Control Panel, I could read the log showing that DHCP requests was received from the PC, repeatedly. The ESP32S3 gateway tried to respond but never heard.
- When the gateway failed, pinging/http request won't work (obviously, because the gateway could not transmit data)
- I recorded the RAM usage periodically and didn't find any growth of the RAM/PSRAM usage.
- When the RNDIS feature failed, the ACM CDC console still works (printing logs)
- The softAP is still alive and the wireless node can still talk to the gateway.
Based on what I have found, I think the CPU is working, the RAM is sufficient, the USB peripherial is alive. I guess there is some culprits in the RNDIS TX code that trigger the bug. Please kindly suggest some hint for me to narrow down the possible origin of the issue. Thank you!
Alternatively, I would like to know how to verify the USB transmission is successful or not -- I could reboot the gateway on the nonideal scenario then.
I would like to ask which version of idf you are using?
@tswen 5.0~5.2. I have been constantly working on this firmware and observe the same issue. Since 5.3 this project no longer compile so I didn't try newer.
Our team has started to pinning the issue. The first thing hits me was the global variable can_xmit .
Is it possible that can_xmit might not be reset due to racing condition?
I could confirm that the can_xmit was the problem. After putting the logging code
esp_err_t pkt_netif2usb(void* buffer, uint16_t len) {
if (!tud_ready()) {
ESP_LOGE(TAG, "TUD not ready");
return ERR_USE;
}
if (tud_network_wait_xmit(500)) {
/* if the network driver can accept another packet, we make it happen */
#if ESP_IDF_VERSION >= ESP_IDF_VERSION_VAL(5, 0, 0)
if (tud_network_can_xmit(len)) {
#else
if (tud_network_can_xmit()) {
#endif /* ESP_IDF_VERSION >= 5.0.0 */
// ESP_LOG_BUFFER_HEXDUMP(" netif ==> usb", buffer, len, ESP_LOG_INFO);
ESP_LOGD(TAG, "netif => usb: %d bytes", len);
tud_network_xmit(buffer, len);
} else {
ESP_LOGE(TAG, "tud_network_can_xmit false");
}
} else {
ESP_LOGE(TAG, "tud_network_wait_xmit timed out");
}
return ESP_OK;
}
When the issue occurs, the "tud_network_wait_xmit timed out" is flooding the log.
Apologies, our team is currently busy with other tasks. Are there any updates on this issue? If you have any new findings or could provide a PR, we would greatly appreciate it!
@tswen Hello. After resetting can_xmit after retry it seems no issue was reproduced within a few hours. I need to perform some longer and more throughout test due to the random nature of this issue. If the stability was confirmed I will report here.