esp-idf SDK 5.3.1 WiFi still bugged in TCP/IP stack (IDFGH-14128)

Answers checklist.

[X] I have read the documentation ESP-IDF Programming Guide and the issue is not addressed there.
[X] I have updated my IDF branch (master or release) to the latest version and checked that the issue is present there.
[X] I have searched the issue tracker for a similar issue and not found a similar issue.

General issue report

The SDK 5.3.1 has a very deep bug in the IP/WiFi stack where it get stucked in TCP mode, the UDP mode keep working.

Also the wifi act weird in de-handshake sometimes.

Using WebSocket make the problem to get worst.

Unique way to solve it to deinitialize the lwip and wifi, and recreat it all again.:

Merge branch 'bugfix/fix_some_wifi_bugs_241024_v5.3' into 'release/v5.3' fix(wifi): fix some wifi bugs 241024 v5.3 See merge request espressif/esp-idf!34420

Nov 25 '24 13:11 filzek

@filzek Could you share more detail to reproduce this issue?

Nov 26 '24 08:11 AxelLin

We have not yet found a straightforward way to replicate this process. However, our analysis so far indicates that btm_rrm_t is not being properly destroyed when the Wi-Fi/LWIP stack is reinitialized. This oversight results in the continuous creation of new tasks, leading to task duplication and potential resource exhaustion.

Nov 27 '24 13:11 filzek

Wifi layer still being corrupted and will stop work in multitasking complex tasks. We have this problems for so many versions, we would like to know any true robust system running right now with esp32 without network issues. Seems that for the last 4 years the problem still the same Wifi stocks halts loose connection and never came back. Now tcp/ip layer with same proglems. SDK 5.3.1 not okay.

We really want to understand why things are this deeply bad in keep the connection working???? Why we need to create a lot of patches to try to make the wifi and ip stack barnacle workable in a production environment.

I am about to open to offer for thousands of USD to show that solution aren't working at all in the development level inside espressif and CPUs sold could be extremely effective and can't stand working in production environment.

Things already went too far and now true answer come to the table to solve it. Everyone in espressif push to one to another and no one there really calls it on!

It's time to someone come abroad and solve the problem with the wifi and ip layer, thousands of offline devices that need to be power off and power on again isn't a true solution for this kind of service.

@euripedesrocha can someone come abroad to solve the problem for real????

Nov 29 '24 04:11 filzek

Wifi layer still being corrupted and will stop work in multitasking complex tasks. We have this problems for so many versions, we would like to know any true robust system running right now with esp32 without network issues. Seems that for the last 4 years the problem still the same Wifi stocks halts loose connection and never came back. Now tcp/ip layer with same proglems. SDK 5.3.1 not okay.

Do you mean the older sdk versions (e.g. 5.2.x, 5.1.x) also have the same issue?

Nov 29 '24 05:11 AxelLin

Wifi layer still being corrupted and will stop work in multitasking complex tasks. We have this problems for so many versions, we would like to know any true robust system running right now with esp32 without network issues. Seems that for the last 4 years the problem still the same Wifi stocks halts loose connection and never came back. Now tcp/ip layer with same proglems. SDK 5.3.1 not okay.

Do you mean the older sdk versions (e.g. 5.2.x, 5.1.x) also have the same issue?

@AxelLin NO WAY, latest STABLE WIFI / IP STACK working is SDK 3, all others versions is complete BUGGED the WIFI on ESP32 and makes devices to crash random, and the Espressif know it all and didnt come to public to tell the problem so far, the dev team try to hide the problem as it is nothing happen at all, and we try to always show the problem to them to let them to be able to fix, but they tends to just become deaf to let know that this is sintomatic and spread all over, can occurs random and needs a physical reset to make the hardware works again. This is why 3.1 is out to use, but what about the 1.1, 3.0 hardware in the market, costumers that arent stable because a software failure that hangs the WIFI interface and makes in irreparable????

Nov 30 '24 23:11 filzek

So, why the Wifi / TCP stack cant survive and start to degrade all over aleatory? This happen everywhere and why things arent clear to know what to do or what to do not do? We can release the code that stuck and halt all internal esp32 registries even upon restart of the cpu, it keep a mess, so, only a full power down and power on can recover the inside registries from it, and its very simple to make it happen. This is something related to the current caos, but, not the intent.

The great question is why the WiFi cant stand running and colapse? Why did it not bring and info to the problem?

Dec 02 '24 03:12 filzek

@AxelLin I thinkl the problem could start with something related to the software/hw ble/wifi coexistance, somethings point out there.

Dec 02 '24 05:12 filzek

@filzek, could you attach an sdkconfig? We've also had some networking troubles, but coexistance seems to be working pretty well for us.

Dec 06 '24 16:12 bryghtlabs-richard

@filzek sorry for late reply maybe you can provide the AP's Specific model which you used， or you can provide the wifi wireshark capture and log releat to the issue

Feb 12 '25 03:02 hansw123

Hi @hansw123 @bryghtlabs-richard @AxelLin

We are tracking the issue to the lowest level as possible, but we can't make the problem happen on bench development, only in release field this happen so far.

In our tracking the problem happen following this: 1 - board loses wifi and can't find the AP anymore. We track it and take action to stop the wifi, redefine to default, set parameters and start it again and they to connect again.

2 - WiFi reconnection sometimes loses the IP and can't get it so far, so a manual dhcp stop and start mist be done, but the IP address must be cleaned to all zeroes first.

3 - DHCP loses the IP while the WiFi layer still connected. Just repeat the same as above.

The item 3 track with a running ping continuously to the own IP get in the STA interface, so if it's connected the ping loses should be minimal and if so it is working, but if the IP stop to ping the lwip/dhcp layer is somehow breaked. Fox as the step 2.

The WiFi event handle acting in IP event when the problem 2 happen there is no IP in the interface so doing as said in another code side the issues could be fixed. The log tells that it got the IP but it really doesn't and there is no action on any http server. Websocket or any other part, so the dhcp simple doesn't work as intent so force doing the solution 2 it fixes everything.

The item 1 sometimes is extremely difficult to track as simple it stop working but it doesn't call any disconnection or wifi event handlers, this make totally difficult to track the field deployed devices, asto this solution a set of supervisory ips, pings, external actions, layer check, are done to understand the break on the wifi and so the solution 1 is applied

The Nimble is latest sdk 5.4 with latest commit as feb 24 2025 is working as intent but still problem with asserts yet.

Tomorrow I Will add the sdkconfig to here.

We.have fixed the bugs by alternative corrections, the best is if in the wifi driver it could be fixed as show above as the tracking issues could be something easy to the wifi/lwip team to patch.

Feb 25 '25 02:02 filzek

Updating findings Sometimes the wifi never came back alive, so, no reconnection or even found the AP anymore.

Feb 25 '25 10:02 filzek

@filzek In summary, you are currently experiencing two problems

Sometimes there are problems reconnecting to ap 2、Sometimes sta lose ip and can't get it again.

could you provide some log and wifi capture, or we can,t find the root cause. you can set log level to debug and try to reproduce

Feb 28 '25 03:02 hansw123

our analysis so far indicates that btm_rrm_t is not being properly destroyed when the Wi-Fi/LWIP stack is reinitialized

At least this part should be possible to test with an MCVE as follows:

count number of btm_rrm_t, should be 0
bring up Wi-Fi & LWIP stacks
count number of btm_rrm_t, should be 1
Take down Wi-Fi & LWIP stacks
count number of btm_rrm_t, should be 0 or possibly 1
bring up Wi-Fi & LWIP stacks
count number of btm_rrm_t, must be 1

Jul 08 '25 13:07 bryghtlabs-richard

@filzek @hansw123 @bryghtlabs-richard @AxelLin

Hello, i'm encountering similar problem. The device based on esp32c2 (esp8684) and latest release idf 5.2.5 works most of the time, but sometimes loses its IP address and cannot renew it using the DHCP discover, while being still connected to WIFI. Restarting the router (in order to forcefully disconnect the wifi network from the device) does nothing. The device comes back to wifi network after router is alive and still cannot renew the IP. The only thing (putting aside the device power-cycle restarts) that brings the IP back is starting the Device AP+STA with dhcp server ON (Captive Portal) and turning it OFF (back to STA only)

I've managed to sniff the network using wireshark and PC connected to same wifi and found out that "bricked" device tries to renew IP using malformed DHCP Discover packets. Wireshark presents them as raw BOOTP (DHCP base) protocol, raporting bogus UDP packet length (316 vs 308) and looks like it cannot find the "DHCP magic cookie" (63 82 53 63) as it is shifted to the left by 8bytes. It is clear that either hw_addr padding field, server hostname field or boot file field is missing the data as the Client MAC Address field have correct value and place. In the first place I thought, that it may be caused by low memory issues, but luckily i have diagnostic stats in my captive portal, that show it wasnt.

I dont know how to reproduce it as it is happening randomly. The one factor may be the wifi signal as its quite low in place where it happens (rssi -87):

The device uptime was continous circa 30days since it was last power-cycle rebooted and the DHCP resolving issue happened twice - both times the fix was enabling the device captive portal.

I've attached PCap from the wireshark containing truncated packets from the 68:25:dd:94:c3:70 device and the correct dhcp disovers from the 80:65:99:f3:3f:9e to compare the payloads.

dhcp issue.zip

Today, i've updated it and added the fix from this commit esp idf 5.5.1 (increased the LWIP_DHCP_OPTIONS_LEN to 69) and a custom dhcp watchdog, that restarts dhcp client if it cannot obtain address for more than 30s. We will see if it helps.

Sep 29 '25 08:09 marmurek